Extração da Informação
Adaptado do seminário:
DIAL: A Dedicated Information Extraction Language for Text Mining
de Marcus Sampaio
Estrutura da Apresentação
Estrutura da Apresentação

Prólogo: O Problema e o Tema
Estrutura da Apresentação


Prólogo: O Problema e o Tema
Uma Solução: A Linguagem DIAL
Estrutura da Apresentação



Prólogo: O Problema e o Tema
Uma Solução: A Linguagem DIAL
Uma Digressão: Automatically Trainable
IE Systems
Estrutura da Apresentação



Prólogo: O Problema e o Tema
Uma Solução: A Linguagem DIAL
Uma Digressão: Automatically Trainable
IE Systems

DIAL e o Projeto SAD: uma Discussão
Estrutura da Apresentação




Introdução: O conceito
Um sistema: ClearForest
Uma linguagem: DIAL
Uma Digressão: Automatically Trainable
IE Systems

Bibliografia e Referências
Introdução
Exemplo de um Sistema de
Extração da Informação
Subsistema I
Subsistema II
Apoio On-line à Decisão
da Operação do Sistema
Elétrico
Recuperação Offline avançada de
Informação sobre
Operação
Requisito Maior: informação
concisa, relevante e
imediata em situações de
contingência
Um Esboço de Arquitetura para um
Sistema de Apoio a Decisão (SAD)
Visão
Materializada
integrador
Partes relevantes
anotadas e validadas
(In/No/...).xml
editor gráfico
Validação
wrapper
Instruções (In) /
Normas (No) / ...
Interface de
Apoio à Decisão
Ações
Alarmes
Contingências
Visão
Materializada
Requisito Tempo
- BD ativo e estruturado,
incluindo (muitas) colunas em
formato texto
integrador
Partes relevantes
anotadas e validadas
(In/No/...).xml
editor gráfico
Validação
wrapper
In / No / ...
O Problema
Tema: Extração
de Informação
Extração de Informação

Extração da Informação (Information
Extraction - IE) é uma tecnologia que
ao invés de recuperar documentos
extrai trechos de informação relevantes
para o usuário.
O material deste slide e dos próximos quatro foi copiado de
http://www.itl.nist.gov/iaui/894.02/related_projects/muc/
IE: Glossário de Termos

Attribute


Annotation


assessment of performance according to standard measures
Data


mark up of a text span in a specific format that indicates a
feature or features of the text within the span
Benchmark


a property of an entity such as its name, alias, descriptor, or
type
textual input for an information extraction system
Dataset

a set of texts chosen according to pre-specified conditions
and meant to represent a rich text stream
IE: Glossário de Termos (2)

Entity


Event


an object of interest such as a person
or organization
an activity or occurrence of interest
Fact

a relationship held between two or
more entities
IE: Glossário de Termos (3)

Information Extraction


Information Extraction Systems


an automated system to extract pertinent
information from large volumes of text
Information Extraction Technologies


the extraction or pulling out of pertinent information
from large volumes of texts
techniques used to automatically extract specified
information from text
Metrics

pre-defined measures of performance calculable by
comparison of system output with human-generated
answer keys
IE: Glossário de Termos (4)

Definição de Tarefa (task)


documento que define o formato e
critérios para annotação ou extração
de textos e colocado em um banco de
dados ou template.
Por example, uma definição de tarefa
da instruções e exemplos para a
extração de entidades, atributos,
fatos, e eventos de textos
IE: Estado da Arte, em Termos
de Desempenho
Itens de
Percentual de
Informação Confiabilidade
Entities
90
Attributes
80
Facts
70
Events
60
http://www.itl.nist.gov/iaui/894.02/related_projects/muc/
ClearForest: Um Sistema de IE
ClearForest: Um Sistema de IE

Annotation
Extracting Semistructured Information from the Web
J. Hammer, H. Garcia-Molina, J. Cho, R. Aranha, and A. Crespo
Department of Computer Science
Stanford University
Stanford, CA 94305-9040
{hector,joachim,cho,aranha,crespo}@cs.stanford.edu
http://www-db.stanford.edu/
We describe a configurable tool for extracting semi structured data from a set of HTML pages and for
converting the extracted information into database objects. The input to the extractor is a declarative
specification that states where the data of interest is located on the HTML pages, and how the data
should be “packaged” into objects. We have implemented the Web extractor using the Python
programming language stressing efficiency and ease-of-use. We also describe various ways of
improving the functionality of our current prototype. The prototype is installed and running in the
TSIMMIS test bed as part of a DARPA (Intelligent Integration of Information) technology
demonstration where it is used for extracting weather data form various WWW sites.
ClearForest (2)
- <XML>
<Title>1187099374813-90148</Title>
<Date>2007-08-14</Date>
- <Content Format="ClearForest">
- <Text length="1121">
- <Document>
<Title>1187099374813-90148</Title>
<Date>2007-08-14</Date>
<Body>Extracting Semistructured Information from the Web J. Hammer, H. Garcia-Molina, J. Cho, R. Aranha, and
A. Crespo Department of Computer Science Stanford University Stanford, CA 94305-9040
{hector,joachim,cho,aranha,crespo}@cs.stanford.edu http://www-db.stanford.edu/ We describe a configurable
tool for extracting semi structured data from a set of HTML pages and for converting the extracted information into
database objects. The input to the extractor is a declarative specification that states where the data of interest is
located on the HTML pages, and how the data should be “packaged” into objects. We have implemented the Web
extractor using the Python programming language stressing efficiency and ease-of-use. We also describe various
ways of improving the functionality of our current prototype. The prototype is installed and running in the
TSIMMIS test bed as part of a DARPA (Intelligent Integration of Information) technology demonstration where it
is used for extracting weather data form various WWW sites.</Body>
</Document>
</Text>
</Content>
- <Results>
Observação: A versão demo não permite definir a tarefa de IE
- <Entities>
- <ProvinceOrState offset="249" length="2">
<ProvinceOrState offset="249" length="2">California</ProvinceOrState>
<Detection offset="249" length="2">CA</Detection>
</ProvinceOrState>
- <IndustryTerm offset="358" length="17">
<IndustryTerm offset="358" length="17">configurable tool</IndustryTerm>
<Detection offset="358" length="17">configurable tool</Detection>
</IndustryTerm>
- <IndustryTerm offset="1009" length="24">
<IndustryTerm offset="1009" length="24">technology
demonstration</IndustryTerm>
<Detection offset="1009" length="24">technology
demonstration</Detection>
</IndustryTerm>
- <Technology offset="426" length="4">
<Technology offset="426" length="4">HTML</Technology>
<Detection offset="426" length="4">HTML</Detection>
</Technology>
- <Technology offset="620" length="4">
<Technology offset="620" length="4">HTML</Technology>
<Detection offset="620" length="4">HTML</Detection>
</Technology>
- <Organization offset="188" length="30">
<Organization offset="188" length="30">Department of Computer
Science</Organization>
<Score offset="188" length="30">3</Score>
<Detection offset="188" length="30">Department of Computer
Science</Detection>
</Organization>
ClearForest (3)
ClearForest (4)
- <Organization offset="219" length="28">
<Organization offset="219" length="28">Stanford University
Stanford</Organization>
<Score offset="219" length="28">5</Score>
<Detection offset="219" length="28">Stanford University
Stanford</Detection>
</Organization>
- <Product offset="732" length="6">
<Product offset="732" length="6">Python</Product>
<Detection offset="732" length="6">Python</Detection>
</Product>
</Entities>
<Events_Facts />
</Results>
</XML>
Obs: A qualidade das anotações (marcações) aqui não é alta.
Por exemplo, os autores do artigo não foram marcados
ClearForest (5)

Visualização de documentos marcados
ClearForest (5)

Visualização
Uma Solução para a
Definição de Tarefas de IE:
A Linguagem DIAL
Knowledge Engineering
Approach [Appelt 99]



Gramáticas construídas manualmente
Padrões de domínio definidos por
especialistas humanos por introspecção
e inspecção de um corpus
Regulagem trabalhosa /* O preço a
pagar */
Características da Linguagem


DIAL é um formalismo para definição de tarefas de IE
Quem programa em DIAL?

Engenheiro de Conhecimento




Knowledge Engineering Approach
Tem um grau de independência de um particular
domínio de conhecimento (domain-independence)
Programas DIAL são domain-dependent
Linguagem de IE do ClearForest
Uma Primeira e Ingênua Solução
DIAL para o Exemplo do Prólogo
Extracting Semistructured Information from the Web
J. Hammer, H. Garcia-Molina, J. Cho, R. Aranha, and A. Crespo
Department of Computer Science
Stanford University
Stanford, CA 94305-9040
{hector,joachim,cho,aranha,crespo}@cs.stanford.edu
http://www-db.stanford.edu/
We describe a configurable tool for extracting semi structured data from a set of HTML pages and for
converting the extracted information into database objects. The input to the extractor is a declarative
specification that states where the data of interest is located on the HTML pages, and how the data should be
“packaged” into objects. We have implemented the Web extractor using the Python programming language
stressing efficiency and ease-of-use. We also describe various ways of improving the functionality of our
current prototype. The prototype is installed and running in the TSIMMIS test bed as part of a DARPA
(Intelligent Integration of Information) technology demonstration where it is used for extracting weather
data form various WWW sites.
Uma Solução DIAL ... (2)

Queremos anotar, especificamente para
o exemplo




O título
Os autores
Os e-mails dos autores
A frase fazendo referência ao protótipo
Uma Solução DIAL ... (3)
Concept Titulo {
Attributes:
string Descricao;
};
Rule Titulo {
Pattern:
“Extracting ... from the Web” -> temp
Actions:
Add(Descricao <- temp);
};
Nosso primeiro
programa DIAL
De forma semelhante, para autores, e-mails e o trecho sobre o protótipo
Os padrões estão corretos, mas restritos demais
Uma Solução DIAL ... (4)

Com a ajuda destes conceitos e regras
para os conceitos, o Servidor IE  ver
o slide seguinte  pode anotar o
documento com <Titulo> ... <\Titulo>,
<Autores> ... <\Autores>, <E-mails>
... <\E-mails>, <Prototipo> ...
<\Prototipo>
O Ambiente DIAL
Servidor IE
1
BD de Documentos
Anotados
5
6
Documento
2
4
3
Tarefa de IE
Programa DIAL
Adaptada de [Feldman 2007]
Aplicação
Cliente
Arquitetura Simplificada do
Servidor IE
Input: Non-annotated Text
Basic building
blocks
Tokenization
(Shallow)
Parsing
Pattern instances:
offset, length
DIAL Program
IE Definition
Task
Output: Annotated Text
Shallow parsing (análise rasa) melhora velocidade e
robustesa sacrificando profundidade da análise
Estrutura de um Programa
DIAL


Estrutura em Blocos (linguagem
algólica)
Bloco


Conceitos
Regras
Os próximos quatro slides são copiados de [Feldman 2007]
Estrutura ... (2)
Block
Section
Description
Concept
Attributes
These are filled with
values from the concept
instance’s matching text
Concept
Guards
These are logical
conditions on the
attributes’ values
Concept
Actions
Code operations to
perform after finding a
concept
Estrutura ... (3)
Block
Section
Description
Concept
Internal
Concepts that can be
used only within the
scope of the context and
any inheriting concepts
Concept
Function
Add-on Perl functions,
used only within the
scope of the context and
any inheriting concepts
Concept
Context
Defines the text units in
which to search for the
concept instances
Estrutura ... (4)
Block
Section
Description
Concept
Dependencies
Permits definition of an
explicit dependency of
one concept to another
Rule
Pattern
Defines the text pattern
to match when searching
for a concept instance
Rule
Constraints
Defines logical conditions
to apply to values
extracted from the
pattern match
Estrutura ... (5)
Block
Rule
Section
Action
Description
Code operations to
perform after finding a
pattern match. Among
other things, this is
where concept instances
are added
Guards, Contraints, Function,
Internal

Por que tanta riqueza de construções?

Estima-se que o uso de Guards,
Contraints, Function e Internal
particamente se limita a situações em que
o documento para extração de informação
contém erros ou inconsistências
Como se Proteger de uma
Data Errada?
Concept Data {
Attributes:
number Dia;
number Mes;
number Ano;
Guards:
(Dia >= 1) AND (Dia <= 31);
(Mes >= 1) AND (Mes <= 12);
(Ano > 0);
};
Contexto
Concept ProcedimentoDeManobra {
...
Context:
(NOT Introdução) AND (NOT Seção
X) AND ... // evita fazer
// parsing nas seções indicadas
...
}
Patterns








String Constants
Wordclass Names
Thesaurus Names
Concept Names
Character-Level Regular Expressions
Character Classes
Scanner Properties
Token Elements
Exemplo 1: Extraindo Nomes
de Pessoas



No exemplo de motivação, tínhamos
J. Hammer, H. Garcia-Molina, J. Cho, R.
Aranha, and A. Crespo
Mas um nome de pessoa pode ser precedido
de um título, ou um nome do meio, etc.
Vamos definir o conceito Pessoa, e regras
para Pessoa, de modo a ter uma certa
generalidade

J. Hammer, Miss Alice Douglas Fitzgerald, ...
Exemplo 1 ... (2)
Wordclass wcTitulo = “Mr.” Mr Mrs “Mrs.” Miss /* negrito:
obrigatório */
Concept Pessoa {
Attributes:
string Titulo;
string Prenome;
string NomeDoMeio;
string UltimoNome;
};
Rule Pessoa {
Pattern:
(wcTitulo ?)? -> titulo //  Titulo
Capital -> prenome (Capital ?)? -> nomedomeio
Capital -> ultimonome
Actions:
Add (Titulo <- titulo, Prenome <- prenome.Text(), NomeDoMeio
<- nomedomeio.Text(), UltimoNome <- ultimonome.Text())
};
X? indica que a instância que casa com X é opcional
Exemplo 2: Um Sofisticado
Pattern




Engenheiros laureados X, Y, ... e Z
Engenheiros premiados X, Y, ... e Z
Técnicos laureados X, Y, ... e Z
Técnicos premiados X, Y, ... e Z
Exemplo 2 ... (2)
Concept ListaDePessoas {};
Wordclass wcSubstantivos = engenheiros técnicos
Wordclass wcAdjetivos = laureados premiados
Rule ListaDePessoas {
Pattern: wcSubstantivos wcAdjetivos
(Pessoa /* re-uso de conceito */ ->> /* operador de
lista append */ Lista “,” ?)+ “and” Pessoa ->> Lista
Actions:
Iterate (Lista) Begin
PessoaCorrente = Lista.CurrentItem();
Add (Pessoa, PessoaCorrente, Prenome <PessoaCorrente.Prenome, UltimoNome <PessoaCorrente.UltimoNome);
End;
};
Exemplo 3: Um Pattern Mais Refinado para a
Frase Fazendo Referência a um Protótipo
We describe a configurable tool for extracting semi structured data
from a set of HTML pages and for converting the extracted
information into database objects. The input to the extractor is
a declarative specification that states where the data of interest
is located on the HTML pages, and how the data should be
“packaged” into objects. We have implemented the Web
extractor using the Python programming language stressing
efficiency and ease-of-use. We also describe various ways of
improving the functionality of our current prototype. The
prototype is installed and running in the TSIMMIS test bed as
part of a DARPA (Intelligent Integration of Information)
technology demonstration where it is used for extracting
weather data form various WWW sites.
Observe que duas frases empregam a palavra protótipo: a verde e
a vermelha.
Exemplo 3 ... (2)
“”.” ‘”Token+”’ prototype
‘”Token+”’ “.””
e
“”.” Prototype ‘”Token+”’ “.””

Uma regra para cada pattern
Exemplo 4: Um Pattern Mais Refinado para a
Frase Fazendo Referência a um Protótipo
We describe a configurable tool for extracting semi structured data
from a set of HTML pages and for converting the extracted
information into database objects. The input to the extractor is
a declarative specification that states where the data of interest
is located on the HTML pages, and how the data should be
“packaged” into objects. We have implemented the Web
extractor using the Python programming language stressing
efficiency and ease-of-use. We also describe various ways of
improving the functionality of our current prototype. The
prototype is installed and running in the TSIMMIS test bed as
part of a DARPA (Intelligent Integration of Information)
technology demonstration where it is used for extracting
weather data form various WWW sites.
Queremos escolher uma frase somente se ela contiver
TSIMMIS e DARPA, em qualquer ordem
Exemplo 3 ... (2)
“”.” ‘”Token+”’ TSIMM ‘”Token+”’
DARPA ‘”Token+”’ “.””
e
“”.” ‘”Token+”’ DARPA ‘”Token+”’
TSIMM ‘”Token+”’ “.””
 Uma regra para cada pattern
 Token: substitui qualquer token cujo valor
não é importante
 Token+: substitui um ou mais tokens
Rule Constraints
Suponha que queremos definir um pattern
como uma seqüência de palavras de qualquer
comprimento, exceto 5
Pattern:
Capital+; /* Capital: qq token
começando com letra maiúscula */
Constraints:
this_match.TokenCount() != 5;

Tipos de Pattern: Mais
Exemplos

Character-Level Regular Expressions


Character Classes


McDonald casa com “Mc[A-Z][a-z]+”
Charclass cc<nome> = conjunto de
character-level regular expressions
Scanner Properties

Number /* só caracteres numéricos */, Alfa /*
só caracteres alfabéticos */
Considerações Finais

Geração de anotações


Toda instância de um conceito é marcada
com o nome do conceito
A BNF da linguagem não está disponível
Automatically Trainable IE
Systems
Automatically Trainable
Systems [Appelt 99]



Use statistical methods when possible
Learn rules from annotated corpora /*
Achar um algoritmo adequado de
indução de regras? Desenvolver um
novo algoritmo? */
Learn rules from interaction with user
(LP)2: Um Algoritmo para
Indução de Regras
user-defined
tagged corpus
(by a knowledge engineer)
Algorithm
(LP)2
symbolic rules
for inserting
SGML tags into texts
- Shallow knowledge about Natural Language Processing
- Bottom-up generalization of examples in the training corpus
non-tagged
free text
Algorithm
(LP)2
LEARNINGPINOCCHIO:
IE tool that uses (LP)2
SGML-tagged
text
(LP)2 ... (2)

A lógica do algoritmo [Ciravegna 2001]

Tagging rule

Pattern of conditions on words  Action
inserting SGML tag in texts
window of
w words
user-defined
tag
window of
w words
2*w word window

Helped by


Morphological analyzer
Dictionary or Gazetteer
Shallow knowledge
about LP
(LP)2 ... (3)

Um Exemplo
Um trecho de um texto de um corpus ... The
Seminar at 4 pm will be ...
é marcado assim ... The Seminar at <stime>
4 pm will be ... /* Note: </stime> não é
necessária */
A primeira regra: The Seminar at 4 pm 
The Seminar at <stime> 4 pm </stime>
Regra generalizada (muito mais significativa):
at digit timeid  at <time> digit timeid
</time>
(LP)2 ... (4)


Note que as regras induzidas são
completamente diferentes dos conceitos /
regras DIAL
O que dizer da precisão da regra induzida?



Diversas outras regras de tempo podem ser
induzidas, com maior ou menor precisão
Necessidade de corpora grande e significativo


É diretamente proporcional ao número de
documentos no corpus que se refere a tempo da
mesma maneira (padrão ou freqüência)
Padrões confiáveis
(LP)2 está no cerne de um protótipo de IE
chamado LearningPinocchio
Knowledge Engineering Approach versus
Automatically Trainable Systems

Qual abordagem é a melhor?

Evidentemente, nenhuma das duas é
definitivamente a melhor


Ou cada uma tem suas vantagens, e suas
desvantagens
Também evidentemente, há os
‘apaixonados’ por cada enfoque
Knowledge Engineering Approach versus
Automatically Trainable Systems (2)

DIAL versus (LP)2

DIAL



Engenheiro de conhecimento (EC) tem que escrever
programas DIAL, possivelmente não triviais
A precisão das regras vai depender da habilidade e
conhecimento do EC, e do poder de expressão da DIAL
Um parser avançado terá que ser construído

Apoiado em programas DIAL para a análise semântica de
textos, ou análise do domínio
Knowledge Engineering Approach versus
Automatically Trainable Systems (3)

(LP)2

O trabalho do EC agora pode ser simples e trivial


O resto, corre por conta do (LP)2


Fazer anotações em corpora
Indução automática, dos corpora, de regras de anotação
de textos fora dos corpora
‘Moscas na sopa’


Quem pode garantir a priori que (LP)2 induza regras
relevantes e precisas sobre um particular domínio?
Dispõe-se de corpora?
 Volume e qualidade
Knowledge Engineering Approach versus
Automatically Trainable Systems (4)

Fato 1

Até nossos dias, Knowledge Engineering Approach
tem melhor desempenho, na média, do que
Automatically Trainable Systems

Fato 2

Automatically Trainable Systems é um enfoque
extremamente sedutor

O grosso dos cientistas concernentes está se debruçando
sobre a abordagem

Text Mining, Machine Learning
Bibliografia e Referências







Definição de IE:
http://www.itl.nist.gov/iaui/894.02/related_projects/muc/
Sistema de IE: http://sws.clearforest.com/Blog/
[Feldman 2007] Feldman, R.; Sanger, J. The Text Mining Handbook:
Advanced Approaches in Analyzing Unstructured Data, Cambridge
University Press, 2007
Feldman, R. et al. A Domain Independent Environment for Creating
Information Extraction Modules, International Conference on
Information and Knowledge Management (CIKM’01), pp. 586-592
Ciravegna, F. Adaptive Extraction from Text by Rule Induction and
Generalization, IJCAI, 2001
Sistema de IE: http://tcc.itc.it/research/textec/toolsresources/learningpinocchio.html
[Appelt 99] Appelt, D.; Israel, D. Introduction to Information Extraction
Technology, Tutorial Prepared for IJCAI, 1999