ExATOlp: extraction of language resources from Portuguese corpora Lucelene Lopes, Renata Vieira, Paulo Fernandes, and Gabriel Couto FACIN – PUCRS – Porto Alegre – Brazil {lucelene.lopes, renata.vieira, paulo.fernandes, gabriel.couto}@pucrs.br Abstract. This paper presents four main features of the ExATOlp software tool. These features provide the following language resources: corpus relevant terms and their morpho-syntactic and frequency features; concordancer (terms contexts); concept tags; and concept hierarchies. The emphasis of the tool relies on the high quality of extracted terms. The provided resources offer a concise representation of non-obvious characteristics of the extracted terms. 1 Introduction ExATOlp is a software tool to extract relevant terms from a domain corpus written in Portuguese. The first version of this tool [5] was a prototype with some basic techniques, but recently a new version of ExATOlp was released incorporating enhanced extraction techniques proposed by Lopes [4]. Among the improvements made for the new version of ExATOlp, a set with four features was designed to deliver useful language resources. This paper describes briefly the core of ExATOlp in Section 2. Section 3 presents few details and examples of linguistic resources generated using ExATOlp. 2 ExATOlp Core The basic operation of ExATOlp consists in receiving a linguistic annotated corpus written in Portuguese and detect noun phrases whitin. The current version of ExATOlp accepts texts annotated by PALAVRAS parser [1], but also other parsers, e.g., LX-center [8], could be employed with the required adaptations. An important aspect in the extraction is that noun phrases are defined generically as any simple or multiple word term that bears conceptual significance. Such general definition of noun phrases is in accordance with the modern terminological literature [3]. The most important linguistic distinction brought by the ExATOlp extraction procedure is the use of a set of 11 heuristic rules to improve the quality of extracted terms. Such heuristics were proposed by Lopes [4] and their effectiveness was analyzed in Lopes et al. [6] which presents precision and recall improvements of approximatively 50%. 2 Lopes et al. In addition to the linguistic processing made by the 11 heuristic rules, ExATOlp procedure is considerably enhanced by the use of frequency relevance estimation made with the index called tf-dcf (term frequency, disjoint corpora frequency), a novel relevance index proposed by Lopes [4]. The use of such index allows the considerable improvement in extraction precision of approximatively 30% over the standard absolute frequency. Furthermore, the precision achieved by the use of the tf-dcf index was observed to be superior to other similar relevance indexes, for instance [7, 2]. 3 Language Resources Four features were implemented in the current version of ExATOlp in order to provide language resources based on the relevant terms extracted. The full list of extracted terms may be reduced to consider only the more relevant terms. Obviously, this choice of which terms are the more relevant depends on many factors, including the purpose of generated resources. In this paper we will not discuss it further than saying that ExATOlp offers to the user many options of relevance criteria, including a default option originally proposed by Lopes in [4]. Specifically, the language resources currently provided are: – List of terms and their morpho-syntactic and frequency features a machine readable resource providing each relevant term found within the corpus of interest; – Concordancer - a human readable list of sentences in which a given term was found within the corpus of interest; – Concept cloud - a set of extracted terms considered the more relevant (i.e., concepts) disposed in a “tag cloud” format where the terms are written with different size fonts according to their relevance; – Concept Hierarchy - a set of extracted concepts disposed hierarchically under the form of a hyperbolic tree structured according to their semantic classification and the word playing the role of noun phrase head. Figure 1(a) shows an extract of a list of extracted bigrams from a Geology corpus, showing the terms as written in the corpus (term), its lemmatized version (lemma), its head word (head ), its semantic tag (sem tag) and its absolute frequency (tf ) and term frequency, disjoint corpora frequency (tf-dcf ). In this extract, only terms with the head word “lago” (in English “lake”) were considered. Figure 1(b) exemplifies the output of the ExATOlp concordancer applied to locate and print in a html file the sentences in which the term “parentes” (“relatives” in English) appears in a Pediatrics corpus. This html file allows the user to observe some additional contextual information about each term occurrence. Figure 1(c) presents an example of a Concept Cloud with relevant bigrams of a Pediatrics corpus, i.e., the more relevant extracted bigrams disposed randomly, but with font sizes proportionally to their relevance. Figure 1(d) presents an example of a hyperbolic tree showing a partial view (detail of the concept “lugares aquáticos” - “aquatic places” in English) of the hierarchy of concepts extracted from a Geology corpus. ExATOlp (a) (c) (b) (d) 3 Fig. 1. Examples of linguistic resources References 1. E. Bick. The parsing system PALAVRAS: automatic grammatical analysis of portuguese in constraint grammar framework. PhD thesis, Arhus University, 2000. 2. S. N. Kim, T. Baldwin, and M.-Y. Kan. Extracting domain-specific words - a statistical approach. In L. Pizzato and R. Schwitter, editors, Proc. of the 2009 Australasian Language Technology Association Workshop, pages 94–98, Sydney, Australia, December 2009. Australasian Language Technology Association. 3. H. Kuramoto. Uma abordagem alternativa para o tratamento e a recuperação de informação textual : os sintagmas nominais. Revista Ciência da Informação, 25(2), 1996. 4. L. Lopes. Extração automática de conceitos a partir de textos em lı́ngua portuguesa. PhD thesis, PUCRS, 2012. 5. L. Lopes, P. Fernandes, R. Vieira, and G. Fedrizzi. ExATO lp – An Automatic Tool for Term Extraction from Portuguese Language Corpora. In Proc. of the 4th Language & Tech. Conf. (LTC ’09), pages 427–431. Adam Mickiewicz Univ., 2009. 6. L. Lopes and R. Vieira. Improving quality of portuguese term extraction. In PROPOR 2012 – International Conference on Computational Processing of Portuguese Language, 2012. 7. Y. Park, S. Patwardhan, K. Visweswariah, and S. C. Gates. An empirical analysis of word error rate and keyword error rate. In INTERSPEECH, pages 2070–2073, 2008. 8. J. Silva, A. Branco, S. Castro, and R. Reis. Out-of-the-box robust parsing of portuguese. In PROPOR 2010, pages 75–85, 2010.