ExATOlp: extraction of language resources from
Portuguese corpora
Lucelene Lopes, Renata Vieira, Paulo Fernandes, and Gabriel Couto
FACIN – PUCRS – Porto Alegre – Brazil
{lucelene.lopes, renata.vieira, paulo.fernandes, gabriel.couto}@pucrs.br
Abstract. This paper presents four main features of the ExATOlp software tool. These features provide the following language resources: corpus
relevant terms and their morpho-syntactic and frequency features; concordancer (terms contexts); concept tags; and concept hierarchies. The
emphasis of the tool relies on the high quality of extracted terms. The provided resources offer a concise representation of non-obvious characteristics
of the extracted terms.
1
Introduction
ExATOlp is a software tool to extract relevant terms from a domain corpus written
in Portuguese. The first version of this tool [5] was a prototype with some basic
techniques, but recently a new version of ExATOlp was released incorporating
enhanced extraction techniques proposed by Lopes [4]. Among the improvements
made for the new version of ExATOlp, a set with four features was designed to
deliver useful language resources. This paper describes briefly the core of ExATOlp
in Section 2. Section 3 presents few details and examples of linguistic resources
generated using ExATOlp.
2
ExATOlp Core
The basic operation of ExATOlp consists in receiving a linguistic annotated corpus written in Portuguese and detect noun phrases whitin. The current version
of ExATOlp accepts texts annotated by PALAVRAS parser [1], but also other
parsers, e.g., LX-center [8], could be employed with the required adaptations.
An important aspect in the extraction is that noun phrases are defined generically as any simple or multiple word term that bears conceptual significance. Such
general definition of noun phrases is in accordance with the modern terminological
literature [3].
The most important linguistic distinction brought by the ExATOlp extraction
procedure is the use of a set of 11 heuristic rules to improve the quality of extracted
terms. Such heuristics were proposed by Lopes [4] and their effectiveness was
analyzed in Lopes et al. [6] which presents precision and recall improvements of
approximatively 50%.
2
Lopes et al.
In addition to the linguistic processing made by the 11 heuristic rules, ExATOlp
procedure is considerably enhanced by the use of frequency relevance estimation
made with the index called tf-dcf (term frequency, disjoint corpora frequency),
a novel relevance index proposed by Lopes [4]. The use of such index allows the
considerable improvement in extraction precision of approximatively 30% over the
standard absolute frequency. Furthermore, the precision achieved by the use of
the tf-dcf index was observed to be superior to other similar relevance indexes, for
instance [7, 2].
3
Language Resources
Four features were implemented in the current version of ExATOlp in order to
provide language resources based on the relevant terms extracted. The full list
of extracted terms may be reduced to consider only the more relevant terms.
Obviously, this choice of which terms are the more relevant depends on many
factors, including the purpose of generated resources. In this paper we will not
discuss it further than saying that ExATOlp offers to the user many options of
relevance criteria, including a default option originally proposed by Lopes in [4].
Specifically, the language resources currently provided are:
– List of terms and their morpho-syntactic and frequency features a machine
readable resource providing each relevant term found within the corpus of
interest;
– Concordancer - a human readable list of sentences in which a given term was
found within the corpus of interest;
– Concept cloud - a set of extracted terms considered the more relevant (i.e.,
concepts) disposed in a “tag cloud” format where the terms are written with
different size fonts according to their relevance;
– Concept Hierarchy - a set of extracted concepts disposed hierarchically under
the form of a hyperbolic tree structured according to their semantic classification and the word playing the role of noun phrase head.
Figure 1(a) shows an extract of a list of extracted bigrams from a Geology
corpus, showing the terms as written in the corpus (term), its lemmatized version
(lemma), its head word (head ), its semantic tag (sem tag) and its absolute frequency (tf ) and term frequency, disjoint corpora frequency (tf-dcf ). In this extract,
only terms with the head word “lago” (in English “lake”) were considered.
Figure 1(b) exemplifies the output of the ExATOlp concordancer applied to
locate and print in a html file the sentences in which the term “parentes” (“relatives” in English) appears in a Pediatrics corpus. This html file allows the user to
observe some additional contextual information about each term occurrence.
Figure 1(c) presents an example of a Concept Cloud with relevant bigrams of
a Pediatrics corpus, i.e., the more relevant extracted bigrams disposed randomly,
but with font sizes proportionally to their relevance.
Figure 1(d) presents an example of a hyperbolic tree showing a partial view
(detail of the concept “lugares aquáticos” - “aquatic places” in English) of the
hierarchy of concepts extracted from a Geology corpus.
ExATOlp
(a)
(c)
(b)
(d)
3
Fig. 1. Examples of linguistic resources
References
1. E. Bick. The parsing system PALAVRAS: automatic grammatical analysis of portuguese in constraint grammar framework. PhD thesis, Arhus University, 2000.
2. S. N. Kim, T. Baldwin, and M.-Y. Kan. Extracting domain-specific words - a statistical approach. In L. Pizzato and R. Schwitter, editors, Proc. of the 2009 Australasian
Language Technology Association Workshop, pages 94–98, Sydney, Australia, December 2009. Australasian Language Technology Association.
3. H. Kuramoto. Uma abordagem alternativa para o tratamento e a recuperação de
informação textual : os sintagmas nominais. Revista Ciência da Informação, 25(2),
1996.
4. L. Lopes. Extração automática de conceitos a partir de textos em lı́ngua portuguesa.
PhD thesis, PUCRS, 2012.
5. L. Lopes, P. Fernandes, R. Vieira, and G. Fedrizzi. ExATO lp – An Automatic Tool
for Term Extraction from Portuguese Language Corpora. In Proc. of the 4th Language
& Tech. Conf. (LTC ’09), pages 427–431. Adam Mickiewicz Univ., 2009.
6. L. Lopes and R. Vieira. Improving quality of portuguese term extraction. In PROPOR 2012 – International Conference on Computational Processing of Portuguese
Language, 2012.
7. Y. Park, S. Patwardhan, K. Visweswariah, and S. C. Gates. An empirical analysis of
word error rate and keyword error rate. In INTERSPEECH, pages 2070–2073, 2008.
8. J. Silva, A. Branco, S. Castro, and R. Reis. Out-of-the-box robust parsing of portuguese. In PROPOR 2010, pages 75–85, 2010.
Download

ExATOlp: extraction of language resources from