Data Integration:
The Teenage Years
(Alon Halevy, Anand Rajaraman, Joann Ordille)
Emanoel Carlos
ecgfs
2
Why Data Integration?
o Consider FullServe, a company that provides Internet access to
homes, but also sells a few products that support the home
computing infrastructure.
o FullServe decided to extend its reach to Europe. To do so, FullServe
acquired a European company, EuroCard, which is mainly a credit
card provider, but has recently started leveraging its customer base to
enter the Internet market.
3
Why Data Integration?
Human Resources Department
Training and Development Department
Sales Department
Costumer Care Department
4
Why Data Integration?
Some of the databases of EuroCard.
5
Why Data Integration?
o Human Resources Department needs to be able to query for all of its
employees
o FullServe has a single customer support hotline, which customers can
call about any service or product they obtain from the company.
o FullServe wants to build a Web site to complement its telephone
customer service line.
o Combining data from multiple sources can offer opportunities for a
company to obtain a competitive advantage and find opportunities
for improvement.
• suppose we find that in a particular area of the country FullServe
is receiving an unusual number of calls about malfunctions in their
service
6
Why Data Integration?
7
Why Data Integration?
o First, we face the challenge of schema heterogeneity, but on a much
larger scale: millions of tables created by independent authors and in
over 100 languages.
o Second, extracting the data is quite difficult
o Data integration is a key challenge for the advancement of science in
fields such as biology, ecosystems, and water management, where
groups of scientists are independently collecting data and trying to
collaborate with one another.
o Data integration is a challenge for governments who want their
different agencies to be better coordinated.
o And lastly, mash-ups are now a popular paradigm for visualizing
information on the Web.
8
Why Data Integration?
o Query: The focus of most data integration systems is on querying disparate
data sources. However, updating the sources is certainly of interest.
o Number of sources: Data integration is already a challenge for a small number
of sources (fewer than 10 and often even 2!), but the challenges are
exacerbated when the number of sources grows. Transparency.
o Heterogeneity: A typical data integration scenario involves data sources that
were developed independently of each other. As a consequence, the data
sources run on different systems: some of them are databases, but others
may be content management systems or simply files residing in a directory.
o Autonomy: The sources do not necessarily belong to a single administrative
entity.
• We cannot assume that we have full access to the data in a source or that
we can access the data whenever we want.
• Furthermore, the sources can change their data formats and access
patterns at any time, without having to notify any central administrative
entity.
9
Data Integration Architectures
Virtual Integration
Warehouse
10
Data Integration Architectures
Data Broker
11
Data Integration Architectures:
Virtual approach to integration
Mediated Schema
or Warehouse
Query reformulation /
Query over materialized
data
Source descriptions/
Transforms
Wrapper /
Extractor
RDBMS1
Wrapper /
Extractor
Wrapper /
Extractor
Wrapper /
Extractor
RDBMS2
12
Data Integration Architectures:
Warehousing approach
Mediated Schema
or Warehouse
Query reformulation /
Query over materialized
data
Source descriptions/
Transforms
Wrapper /
Extractor
RDBMS1
Wrapper /
Extractor
Wrapper /
Extractor
Wrapper /
Extractor
RDBMS2
13
Example
Movie (title, director, year, genre)
Actors (title, actor)
Plays (movie, location, startTime)
Review (title, rating, description)
S1
S2
S3
Movies(
name,
actors,
director,
genre)
Cinemas(
place,
movie,
start)
CinemasInNYC(
cinema,
title,
startTime)
S4
CinemasInSF(
location,
movie,
startingTime)
S4
Reviews(
Title,
date,
grade,
review)
14
Example
Movie (title, director, year, genre)
Actors (title, actor)
Plays (movie, location, startTime)
Review (title, rating, description)
S1
S2
S3
S4
S4
15
Query Processing
Query over mediated schema
Query
Reformulator
Logical query plan over sources
Query
Optimizer
Physical query plan over sources
Execution
Engine
Wrapper /
Extractor
RDBMS1
Wrapper /
Extractor
Subquery or fetch
Request per source
Wrapper /
Extractor
Wrapper /
Extractor
RDBMS2
16
Query Processing
Movie (title, director, year, genre)
Actors (title, actor)
Plays (movie, location, startTime)
Review (title, rating, description)
S1
Movies(
name,
actors,
director,
genre)
Tuples for Movie
can be obtained
from source S1, but
the attribute title
needs to be
reformulated to
name.
S2
Cinemas(
place,
movie,
start)
S3
CinemasInNYC(
cinema,
title,
startTime)
Tuples for Plays can be
obtained from either source
S2, S3 or S4. Since the latter
is complete for showings in
New York City (same for S4),
we choose it over S2.
S4
S4
CinemasInSF(
location,
movie,
startingTime)
Reviews(
Title,
date,
grade,
review)
Since source S3 requires the title
of a movie as input, and such a
title is not specified in the query,
the query plan must first access
source S1 and then feed the
movie titles returned from S1 as
inputs to S3.
17
Query Processing
Query over mediated schema
In our example, the optimizer will
decide which join algorithm to use
to combine results from S1 and
S3.
For example, the join algorithm
may stream movie titles arriving
from S1 and input them into S3,
or it may batch them up before
sending them to S3.
Wrapper /
Extractor
RDBMS1
Query
Reformulator
Logical query plan over sources
Query
Optimizer
Physical query plan over sources
Execution
Engine
Wrapper /
Extractor
Subquery or fetch
Request per source
Wrapper /
Extractor
Wrapper /
Extractor
RDBMS2
18
Query Processing
Query over mediated schema
Query
Reformulator
Logical query plan over sources
Query
Optimizer
Physical query plan over sources
Execution
Engine
Wrapper /
Extractor
RDBMS1
Wrapper /
Extractor
Subquery or fetch
Request per source
Wrapper /
Extractor
Wrapper /
Extractor
RDBMS2
19
In Proceedings of the AAAI 1995 Spring Symp. on Information Gathering from Heterogeneous, Distributed Enviroments
20
Information Manifold
o The main contribution of the Information Manifold was the way it
described the contents of the data sources it knew about.
o The Information Manifold proposed the method that later became
known as the Local-as-View approach (LAV): an information source is
described as a view expression over the mediated schema.
o Previous approaches employed the Global-as-View (GAV).
GAV: Global-as-view
o GAV defines the mediated schema as a set of views over the data
sources.
22
GAV: Global-as-view
•
The first expression shows how to obtain
tuples for the Movie relation by joining relations
in S1.
•
The second expression obtains tuples for the
Movie relation by joining data from sources S5,
S6, and S7. Hence, the tuples that would be
computed for Movie are the result of the union
of the first two expressions.
•
Also note that the second expression requires
that we know the director, genre, and year of a
movie. If one of these is missing, we will not
have a tuple for the movie in th relation Movie.
•
The third and fourth expressions generate
tuples for the Plays relation by taking the union
of S2 and S3.
23
Query over the mediated schema
o Suppose we have the following query over the mediated schema, asking for
comedies starting after 8 pm:
o Reformulating Q with the source descriptions would yield the following four
logical query plans:
24
LAV: Local-as-View
o Instead of specifying how to compute tuples of the mediated schema,
LAV focuses on describing each data source as precisely as possible
and independently of any other sources.
25
LAV: Local-as-View
o In LAV, sources S5–S7 would be described simply as projection
queries over the Movie relation in the mediated schema.
o With LAV we can also model the source S8 as a join over the mediated
schema:
o Furthermore, we can also express constraints on the contents of data
sources. For example, we can describe the following source that
includes movies produced after 1970 that are all comedies:
26
LAV: Local-as-View
o Consider the following query asking for comedies produced in or after
1960:
o Using the sources S5–S7, we would generate the following
reformulation from the LAV source descriptions:
27
Directions
28
Generating Schema Mappings
o One of the major bottlenecks in setting up a data integration
application is the effort required to create the source descriptions,
and more specifically, writing the semantic mappings between the
sources and the mediated schema.
o Semi-automatically generating schema mappings
• Based on clues that can be obtained from the schemas
themselves, such as linguistic similarities between schema
elements and overlaps in data values or data types of columns.
• Schema mapping tasks are often repetitive. Hence, we could use
Machine Learning techniques that consider the manually created
schema mappings as training data, and generalize from them to
predict mappings between unseen schemas.
• Automatic schema mapping is an AI-Complete problem.
29
Adaptive query processing
o The context in which a data integration system operates is very
dynamic and the optimizer has much less information than the
traditional setting.
o As a result, two things happen:
• The optimizer may not have enough information to decide on a
good plan
• A plan that looks good at optimization time may be arbitrarily bad
if the sources do not respond exactly as expected.
30
Model Management
o The goal of Model Management is to provide an algebra for
manipulating schemas and mappings.
o With such an algebra, complex operations on data sources are
described as simple sequences of operators in the algebra and
optimized and processed using a general system.
o Some of the operators that have been considered include the creation
of mappings, inverting and composing mappings, merging schemas and
schema differencing.
31
XML
32
Querying XML
o Recente
o Falta álgebra
o Linguagens:
• XPath
• XQuery
• Outras: XLST, Xlink, XPointer
33
Xpath (Caminho + condições)
o XML ~ árvore
Bookstore.xm
l
bookstore
book
book
title
author
year
“Harry
Potter”
“J K.
Rowlin
g”
“2005
”
price
@lang
“en”
“29.99”
34
Xpath (construtores)
nodename
Seleciona os nós “nodename”
/
Separador ou Seleciona o nó raíz
//
Seleciona qualquer descendente
.
Seleciona o nó corrente
..
Seleciona o nó pai
@
Seleciona atributos
bookstore
Nós com nome “bookstore”
/bookstore
Seleciona a raíz “bookstore”
bookstore/book Seleciona todos os “book” filhos de “bookstore”
book
Seleciona todos os “book”, não importa onde
bookstore//book Seleciona todos os “book”, que são descendentes de “bookstore”,
não importa o nível
//@lang
Seleciona todos os atributos “lang”
35
Xpath (construtores)
Contains(s1, s2)
Separador
elemento
Acessa o elemento
*
Acessa qualquer
sub elemento
@elemento
Acessa o atributo
//
Acessa qualquer
descendente
[@elemento>70]
Verifica a condição
[3]
Acessa o 3º filho
36
Demonstração XPath
37
XQuery
o XQuery é a linguagem para busca de dados XML
o XQuery está para XML como SQL está para banco de dados
o XQuery é construída em expressões XPath
o XQuery é suportada pela maioria dos banco de dados
o XQuery é uma recomendação da W3C
38
XQuery (Expressão FLWOR)
F or $variavel in expr
<- iterator
L et $variavel := expr
<- atribuição
W here condition
<- condição
O rder by expr
<- ordenamento
R eturn expr
o Apenas return é obrigatório.
39
Demonstração XQuery
40
Peer-to-Peer Data Management
o So far, one must create a mediated schema or centralized database
for the domain of the data integration application.
o Consider data sharing in a scientific context, where data may involve
scientific findings from multiple disciplines, such as genomic data,
diagnosis and clinical trial data, bibliographic data, and drug-related
and clinical trial data.
o The owners of the data might not want to explicitly create a mediated
schema that defines the entire scope of the collaboration and a set of
terms to which every data source owner needs to map.
o The basic approach of peer data management systems, or PDMSs, is
to eliminate the reliance on a central, authoritative mediated schema.
41
Peer-to-Peer Data Management
42
Integração semântica
o Agrupar e combinar dados de diferentes fontes considerando uma
semântica explícita
o Ontologia
43
Integração semântica
Ontologia única
Múltiplas ontologias
Híbrido
44
Integração semântica
o A simple ontology of movies, with subclasses for comedies and
documentaries.
o Given this ontology, the system can reason about the relevance of
individual data sources to a query.
45
Integração semântica
o S1 is relevant to a query asking for all movies, but S2 is not relevant to
a query asking for comedies.
o Sources S3 and S4 are relevant to queries asking for all movies that
won awards, and S4 is relevant to a query asking for movies that won
an Oscar.
46
Data Warehouse
o Data from several operational sources (on-line transaction processing
systems, OLTP) are extracted, transformed, and loaded (ETL) into a
data warehouse.
o Then, analysis, such as online analytical processing (OLAP), can be
performed on cubes of integrated and aggregated data.
47
Data Warehouse: ETL
o Extract
o Transform
• Simple mapping
• Aggregation and normalization
• Calculation
o Load
• Updating extracted data is frequently done on a daily, weekly, or
monthly basis.
48
Operações: Roll Up
o Exemplo de uma operação de Roll Up utilizando a dimensão Tempo.
49
Operações: Drill Down
o Exemplo de uma operação de Drill Down utilizando a dimensão
localização geográfica.
50
Operações: Drill Across
o Ocorre quando o usuário pula um nível intermediário dentro de uma
mesma dimensão.
o O usuário executa um Drill Across quando ele passar de ano direto
para trimestre ou mês.
51
Operações: Drill Through
o Ocorre quando um usuário passa de uma informação contida em
uma dimensão para uma outra.
o Exemplo: um usuário está na dimensão tempo e, no próximo passo,
começa a analisar a informação por região.
52
Operações: Slice and Dice
o Slice é operação que corta o cubo, mas mantém a mesma
perspectiva de visualização dos dados
53
Operações: Slice and Dice
o Tabela 1apresenta as vendas de celulares e pagers.
o Tabela 2representa uma fatia dos dados (operação que visualiza
somente a produção de um tipo de produto –celulares)
54
Operações: Slice and Dice
o Dice é a mudança de perspectiva da visão.
o É a extração de um subcubo ou a interseção de vários slices.
55
Operações: Slice and Dice
o Estávamos visualizando e analisando no sentido estado, cidade, ano,
modelo de produto eproduto (Tabela 1).
o Dice é a mudança de perspectiva para modelo de produto,
o produto, ano, estado e cidade (Tabela2)
56
The Future of Data Integration
o Crowdsourcing and “Human Computing”
• Some conditions are hard for a computer to evaluate.
• For example, the quality of extractions from the Web can be
verified by humans, and schemas and data can be matched by
crowds.
o Lightweight Integration
• We often face a task where we need to integrate data from
multiple sources to answer a question that will be asked only once
or twice. However, the integration needs to be done quickly and
by people who may not have much technical expertise.
• For example, consider a disaster response situation in which
reports are coming from multiple data sources in the field, and the
goal is to corroborate them and quickly share them with the
affected public.
57
The Future of Data Integration
o Visualizing Integrated Data
• Users do not want to view rows of data but rather visualizations
that highlight the important patterns in the data and offer flexible
exploration.
o Cluster- and Cloud-Based Parallel Processing and Caching
• The ultimate vision of the data integration field is to be able to
integrate large numbers of sources with large amounts of data ultimately approaching the scale of the structured part of the
Web.
58
Data Integration:
The Teenage Years
(Alon Halevy, Anand Rajaraman, Joann Ordille)
Emanoel Carlos
ecgfs
Referências
o
HALEVY, Alon; RA JARAMAN, Anand; ORDILLE, Joann. Data integration: the teenage
years. In: Proceedings of the 32nd international conference on Very large data bases.
VLDB Endowment, 2006. p. 9-16.
o
ZIEGLER, Patrick; DITTRICH, Klaus R. Three Decades of Data Integration - all Problems
Solved?. In: Building the Information Society. Springer US, 2004. p. 3-12.Livro 1
o
REEVE, April. Managing Data in Motion: Data Integration Best Practice Techniques and
Technologies. Newnes, 2013.
o
DOAN, AnHai; HALEVY, Alon; IVES, Zachary. Principles of data integration. Elsevier, 2012.
o
LÓSCIO, Bernadette. Integração de Dados: Ontem, hoje e sempre. Disponível em:
http://pt.slideshare.net/bernafarias/integracao-dados-ontem-hoje-e-sempre. Acesso
em 6 de set de 2014.
o
LÓSCIO, Bernadette. Integração de Informações no Governo Eletrônico. Disponível em:
http://pt.slideshare.net/bernafarias/integrao-de-informaes-no-governo-eletrnico.
Acesso em 6 d set de 2014.
o
OLIVEIRA, Stanley R de M. OLAP: Online Analitical Processing. Disponível em:
http://pt.slideshare.net/Valldo/olap-1p. Acesso em 6 de set de 2014.
o
WIDOM, Jennifer. Introduction to Databases . Disponível em:
https://class.stanford.edu/courses/Engineering/db/2014_1/. Acesso em 6 de set de
2014.
Download

Query - Centro de Informática da UFPE