Complex Predicates Annotation in a corpus of Portuguese Iris Hendrickx, Amália Mendes, Sílvia Pereira, Anabela Gonçalves and Inês Duarte Centro de Linguística e Faculdade de Letras da Universidade de Lisboa, Lisboa, Portugal What are complex predicates? Constructions with more than one lexical unit, each contributing part of the information normally associated with a single predicate. A CP behaves like a syntactic unit -> syntactic processes usually related to one of the elements operate over whole CP. • Verb + Noun predicates take a walk, have a rest • Verb + Verb predicates Querer estudar (want to study) Fazer rir (make laugh) 2 Overview of the talk • Introduction • CP typology – Annotated sub set • Annotation system – Special cases • Annotation results – Statistics – Special cases • Future Work 3 CP Theory • • • • First verb can be considered as: Light verb (Jespersen, 1949) Support verb (Gross, 1981) Auxiliary verb (Abeillé et al, 1998) We believe that the first verb in a CP has a predicative content: Both elements in a CP contribute to its overall structure and meaning Duarte, I., M. Miguel & A. Gonçalves (2009). Light verbs as predicates. Paper presented at TABU Dag 2009. Groningen 4 project PREPLEXOS Work presented here is based on following project: PREPLEXOS: Predicados Complexos, tipologia e anotação de corpus developed at Centro de Linguística da Universidade de Lisboa supported by FCT (PTDC/LIN/68241/2006)) Goal : Create a corpus-based resource for linguistic study of CPs in Portuguese 5 CP typology (1) Two verbs in a restructuring construction não me queres dizer (you do not want to tell me) (2) Two verbs in a causative construction with clause union lhe fez espirrar o sangue (made him spit blood) (3) Verb + Noun construction terem medo da tuberculose (have fear of tuberculosis) 6 CP Typology (4) Verb + Secondary predicate • Adjective tornar a história credível (make the story believable) • Prepositional phrase fazer x em pedaços (to make x into pieces) (5) Verb and Verb constructions O Pedro pegou e despediu-se (Pedro took and said goodbye) 7 Annotation System(1) Due to time limitations, we exclude type (4) (verb+ADJ/PP) and we restrict (3) to: • nouns derived from a verb dar um passeio (to take a walk) • nouns expressing an emotion, i.e., psych-nouns ter medo (to be afraid) 8 Annotation System(2) We focus on a sub set of verbs for example: – verb+noun: ter (have), dar (give), fazer (make) – causative: mandar (order), deixar (let), fazer (make) 9 Annotation guidelines follow results of our CP study under a generative grammar framework and are therefore theory-oriented. We do not annotate idiomatic expressions. Annotation Tags • verb + verb constructions (type (1), (2), (5) ) tag : CV • verb + noun constructions (type (3)) tag: CN 10 verb+verb constructions Restructuring constructions: [CVR] • porque nos [CVR] queriam convidar because [they] us wanted to invite ( 'because they wanted to invite us ') Causative constructions: [CVC] • veio abalar estes alicerces espirituais [CVC] fazendo traduzir ao rapaz "Pucelle" de Voltaire (he shacked these spiritual foundations by making translate to the boy "Pucelle" by Voltaire) 11 verb+verb constructions Coordinated verbs [CVE] • e [CVE] vai um e conta ao outro. and goes one and tells to the other (and he tells the other) (appears in informal spoken discourse) 12 verb+noun constructions Bare nouns [CNB] • Facto que leva a CGD a considerar que não [CNB] tem obrigações em relação aos trabalhadores. (The fact that leads the CGD to believe that it doesn't have obligations towards the workers) Nouns with a determiner [CN] • o erro de [CN]fazer uma interpretação literal (the error of making a literal interpretation) 13 Order annotation Two indications: • • CP ordering (position 1, 2, etc.): ordering in canonical form its contextual position in an example: B=Beginning, I=Intermediate, E=End depois de um[CN2_B] aviso[CN3_I] dado[CN1_E] Canonical form: dar um aviso 14 Special cases • Ambiguity • Overlapping CPs • Coordination inside CP 15 Special Case: Ambiguity • Clearly CP: fazer perceber aos cidadãos em geral, que a fotocópia corresponde a um acto de pirataria inaceitável (make understand to all citizens that a photocopy corresponds to an act of unacceptable piracy) • Clearly embedded clause: fazer os cidadãos perceber que a fotocópia corresponde a um acto de pirataria inaceitável • Ambiguous: CP or just a verb+embedded infinitive clause! uma forte vontade de fazer progredir o processo de paz (a strong will to make progress the process of peace) 16 Special case: Overlap Two CPs overlap in one verb -> double tag • não o queriam[CVR1_B] deixar[CVR2_E][CVC_VINF1_B] fugir[CVC_VINF2_E] ([they ]not him want to let escape) Two CPs: queriam deixar and deixar fugir 17 Special case: coordination Two CPs share the same verb in a conjunctive clause: • para quem o quis[CVR1_B] ouvir[CVR2_1_E] e eventualmente registar[CVR2_2_E] (to whom wanted to listen and eventually register him) • nós temos[CN1_B] uma[CN2_1_I] tristeza[CN3_1_E] / uma[CN2_2_I] frustração[CN3_2_E] muito grande (we have a sadness / a frustration very deep) 18 Corpus • The CINTIL corpus contains 1 million tokens of Portuguese. It was compiled using different existing resources and contains both spoken (1/3) and written texts (2/3 of the corpus) • The CINTIL corpus is available for online queries (//cintil.ul.pt) F. Barreto, A. Branco, E. Ferreira, A. Mendes, M. F. P. Bacelar do Nascimento, F. Nunes, and J. Silva. 2006. Open resources and tools for the shallow processing of Portuguese. LREC 2006 19 Annotation results 20 label written spoken total CV total 470 219 689 CVR 470 47 81 CVC 13 3 16 CVE 0 1 1 CVR_VINF 300 143 443 CVC_VINF 123 25 148 CN total 706 586 1292 CNB 353 213 566 CN 353 373 726 total 1176 805 1981 Annotator agreement a small experiment: Two annotators annotated 50 sentences independently: a kappa value of .81 21 Special cases Zooming in on the frequencies of the special cases in the CINTIL corpus label written spoken total 22 CV ambiguity 423 168 591 coordination 15 13 28 overlap 10 16 6 To what extent occur CPs in canonical form? • verb+verb constructions always occur in canonical form. • determiner-noun+verb (CN) constructions occur much more often in a different order than bare-noun+verb (CNB) constructions. 23 label written spoken total % of occ CN 86 37 123 16.9 CNB 7 2 9 1.6 Conclusion • • • • 24 We presented the annotation process of complex predicates in the CINTIL corpus. We showed a broad statistical analysis of the results and zoomed in on some research questions. In total, almost 2000 examples were annotated in the corpus. This resource will be used for further investigation. Future Work • Further analyze the results of the verb+verb types of CPs. • Large number of ambiguous cases and the few contexts which give us definite clues for categorizing the sequence as a CP challenges our concept of complex predicates. • As to the verb+noun constructions, we want to examine the contexts with and without determiner to see if the same CP can occur in both structures. 25 Future Work(2) • To look at a broader list of first verbs: for example, certain contexts of psych-nouns like sentir medo `feel fear', experienciar uma profunda emoção `experience a deep emotion', where the predicative nature of the verb is unclear. • To enlarge our description and annotation of CPs to include idiomatic expressions with light verbs. 26