Data Integration: The Teenage Years (Alon Halevy, Anand Rajaraman, Joann Ordille) Emanoel Carlos ecgfs 2 Why Data Integration? o Consider FullServe, a company that provides Internet access to homes, but also sells a few products that support the home computing infrastructure. o FullServe decided to extend its reach to Europe. To do so, FullServe acquired a European company, EuroCard, which is mainly a credit card provider, but has recently started leveraging its customer base to enter the Internet market. 3 Why Data Integration? Human Resources Department Training and Development Department Sales Department Costumer Care Department 4 Why Data Integration? Some of the databases of EuroCard. 5 Why Data Integration? o Human Resources Department needs to be able to query for all of its employees o FullServe has a single customer support hotline, which customers can call about any service or product they obtain from the company. o FullServe wants to build a Web site to complement its telephone customer service line. o Combining data from multiple sources can offer opportunities for a company to obtain a competitive advantage and find opportunities for improvement. • suppose we find that in a particular area of the country FullServe is receiving an unusual number of calls about malfunctions in their service 6 Why Data Integration? 7 Why Data Integration? o First, we face the challenge of schema heterogeneity, but on a much larger scale: millions of tables created by independent authors and in over 100 languages. o Second, extracting the data is quite difficult o Data integration is a key challenge for the advancement of science in fields such as biology, ecosystems, and water management, where groups of scientists are independently collecting data and trying to collaborate with one another. o Data integration is a challenge for governments who want their different agencies to be better coordinated. o And lastly, mash-ups are now a popular paradigm for visualizing information on the Web. 8 Why Data Integration? o Query: The focus of most data integration systems is on querying disparate data sources. However, updating the sources is certainly of interest. o Number of sources: Data integration is already a challenge for a small number of sources (fewer than 10 and often even 2!), but the challenges are exacerbated when the number of sources grows. Transparency. o Heterogeneity: A typical data integration scenario involves data sources that were developed independently of each other. As a consequence, the data sources run on different systems: some of them are databases, but others may be content management systems or simply files residing in a directory. o Autonomy: The sources do not necessarily belong to a single administrative entity. • We cannot assume that we have full access to the data in a source or that we can access the data whenever we want. • Furthermore, the sources can change their data formats and access patterns at any time, without having to notify any central administrative entity. 9 Data Integration Architectures Virtual Integration Warehouse 10 Data Integration Architectures Data Broker 11 Data Integration Architectures: Virtual approach to integration Mediated Schema or Warehouse Query reformulation / Query over materialized data Source descriptions/ Transforms Wrapper / Extractor RDBMS1 Wrapper / Extractor Wrapper / Extractor Wrapper / Extractor RDBMS2 12 Data Integration Architectures: Warehousing approach Mediated Schema or Warehouse Query reformulation / Query over materialized data Source descriptions/ Transforms Wrapper / Extractor RDBMS1 Wrapper / Extractor Wrapper / Extractor Wrapper / Extractor RDBMS2 13 Example Movie (title, director, year, genre) Actors (title, actor) Plays (movie, location, startTime) Review (title, rating, description) S1 S2 S3 Movies( name, actors, director, genre) Cinemas( place, movie, start) CinemasInNYC( cinema, title, startTime) S4 CinemasInSF( location, movie, startingTime) S4 Reviews( Title, date, grade, review) 14 Example Movie (title, director, year, genre) Actors (title, actor) Plays (movie, location, startTime) Review (title, rating, description) S1 S2 S3 S4 S4 15 Query Processing Query over mediated schema Query Reformulator Logical query plan over sources Query Optimizer Physical query plan over sources Execution Engine Wrapper / Extractor RDBMS1 Wrapper / Extractor Subquery or fetch Request per source Wrapper / Extractor Wrapper / Extractor RDBMS2 16 Query Processing Movie (title, director, year, genre) Actors (title, actor) Plays (movie, location, startTime) Review (title, rating, description) S1 Movies( name, actors, director, genre) Tuples for Movie can be obtained from source S1, but the attribute title needs to be reformulated to name. S2 Cinemas( place, movie, start) S3 CinemasInNYC( cinema, title, startTime) Tuples for Plays can be obtained from either source S2, S3 or S4. Since the latter is complete for showings in New York City (same for S4), we choose it over S2. S4 S4 CinemasInSF( location, movie, startingTime) Reviews( Title, date, grade, review) Since source S3 requires the title of a movie as input, and such a title is not specified in the query, the query plan must first access source S1 and then feed the movie titles returned from S1 as inputs to S3. 17 Query Processing Query over mediated schema In our example, the optimizer will decide which join algorithm to use to combine results from S1 and S3. For example, the join algorithm may stream movie titles arriving from S1 and input them into S3, or it may batch them up before sending them to S3. Wrapper / Extractor RDBMS1 Query Reformulator Logical query plan over sources Query Optimizer Physical query plan over sources Execution Engine Wrapper / Extractor Subquery or fetch Request per source Wrapper / Extractor Wrapper / Extractor RDBMS2 18 Query Processing Query over mediated schema Query Reformulator Logical query plan over sources Query Optimizer Physical query plan over sources Execution Engine Wrapper / Extractor RDBMS1 Wrapper / Extractor Subquery or fetch Request per source Wrapper / Extractor Wrapper / Extractor RDBMS2 19 In Proceedings of the AAAI 1995 Spring Symp. on Information Gathering from Heterogeneous, Distributed Enviroments 20 Information Manifold o The main contribution of the Information Manifold was the way it described the contents of the data sources it knew about. o The Information Manifold proposed the method that later became known as the Local-as-View approach (LAV): an information source is described as a view expression over the mediated schema. o Previous approaches employed the Global-as-View (GAV). GAV: Global-as-view o GAV defines the mediated schema as a set of views over the data sources. 22 GAV: Global-as-view • The first expression shows how to obtain tuples for the Movie relation by joining relations in S1. • The second expression obtains tuples for the Movie relation by joining data from sources S5, S6, and S7. Hence, the tuples that would be computed for Movie are the result of the union of the first two expressions. • Also note that the second expression requires that we know the director, genre, and year of a movie. If one of these is missing, we will not have a tuple for the movie in th relation Movie. • The third and fourth expressions generate tuples for the Plays relation by taking the union of S2 and S3. 23 Query over the mediated schema o Suppose we have the following query over the mediated schema, asking for comedies starting after 8 pm: o Reformulating Q with the source descriptions would yield the following four logical query plans: 24 LAV: Local-as-View o Instead of specifying how to compute tuples of the mediated schema, LAV focuses on describing each data source as precisely as possible and independently of any other sources. 25 LAV: Local-as-View o In LAV, sources S5–S7 would be described simply as projection queries over the Movie relation in the mediated schema. o With LAV we can also model the source S8 as a join over the mediated schema: o Furthermore, we can also express constraints on the contents of data sources. For example, we can describe the following source that includes movies produced after 1970 that are all comedies: 26 LAV: Local-as-View o Consider the following query asking for comedies produced in or after 1960: o Using the sources S5–S7, we would generate the following reformulation from the LAV source descriptions: 27 Directions 28 Generating Schema Mappings o One of the major bottlenecks in setting up a data integration application is the effort required to create the source descriptions, and more specifically, writing the semantic mappings between the sources and the mediated schema. o Semi-automatically generating schema mappings • Based on clues that can be obtained from the schemas themselves, such as linguistic similarities between schema elements and overlaps in data values or data types of columns. • Schema mapping tasks are often repetitive. Hence, we could use Machine Learning techniques that consider the manually created schema mappings as training data, and generalize from them to predict mappings between unseen schemas. • Automatic schema mapping is an AI-Complete problem. 29 Adaptive query processing o The context in which a data integration system operates is very dynamic and the optimizer has much less information than the traditional setting. o As a result, two things happen: • The optimizer may not have enough information to decide on a good plan • A plan that looks good at optimization time may be arbitrarily bad if the sources do not respond exactly as expected. 30 Model Management o The goal of Model Management is to provide an algebra for manipulating schemas and mappings. o With such an algebra, complex operations on data sources are described as simple sequences of operators in the algebra and optimized and processed using a general system. o Some of the operators that have been considered include the creation of mappings, inverting and composing mappings, merging schemas and schema differencing. 31 XML 32 Querying XML o Recente o Falta álgebra o Linguagens: • XPath • XQuery • Outras: XLST, Xlink, XPointer 33 Xpath (Caminho + condições) o XML ~ árvore Bookstore.xm l bookstore book book title author year “Harry Potter” “J K. Rowlin g” “2005 ” price @lang “en” “29.99” 34 Xpath (construtores) nodename Seleciona os nós “nodename” / Separador ou Seleciona o nó raíz // Seleciona qualquer descendente . Seleciona o nó corrente .. Seleciona o nó pai @ Seleciona atributos bookstore Nós com nome “bookstore” /bookstore Seleciona a raíz “bookstore” bookstore/book Seleciona todos os “book” filhos de “bookstore” book Seleciona todos os “book”, não importa onde bookstore//book Seleciona todos os “book”, que são descendentes de “bookstore”, não importa o nível //@lang Seleciona todos os atributos “lang” 35 Xpath (construtores) Contains(s1, s2) Separador elemento Acessa o elemento * Acessa qualquer sub elemento @elemento Acessa o atributo // Acessa qualquer descendente [@elemento>70] Verifica a condição [3] Acessa o 3º filho 36 Demonstração XPath 37 XQuery o XQuery é a linguagem para busca de dados XML o XQuery está para XML como SQL está para banco de dados o XQuery é construída em expressões XPath o XQuery é suportada pela maioria dos banco de dados o XQuery é uma recomendação da W3C 38 XQuery (Expressão FLWOR) F or $variavel in expr <- iterator L et $variavel := expr <- atribuição W here condition <- condição O rder by expr <- ordenamento R eturn expr o Apenas return é obrigatório. 39 Demonstração XQuery 40 Peer-to-Peer Data Management o So far, one must create a mediated schema or centralized database for the domain of the data integration application. o Consider data sharing in a scientific context, where data may involve scientific findings from multiple disciplines, such as genomic data, diagnosis and clinical trial data, bibliographic data, and drug-related and clinical trial data. o The owners of the data might not want to explicitly create a mediated schema that defines the entire scope of the collaboration and a set of terms to which every data source owner needs to map. o The basic approach of peer data management systems, or PDMSs, is to eliminate the reliance on a central, authoritative mediated schema. 41 Peer-to-Peer Data Management 42 Integração semântica o Agrupar e combinar dados de diferentes fontes considerando uma semântica explícita o Ontologia 43 Integração semântica Ontologia única Múltiplas ontologias Híbrido 44 Integração semântica o A simple ontology of movies, with subclasses for comedies and documentaries. o Given this ontology, the system can reason about the relevance of individual data sources to a query. 45 Integração semântica o S1 is relevant to a query asking for all movies, but S2 is not relevant to a query asking for comedies. o Sources S3 and S4 are relevant to queries asking for all movies that won awards, and S4 is relevant to a query asking for movies that won an Oscar. 46 Data Warehouse o Data from several operational sources (on-line transaction processing systems, OLTP) are extracted, transformed, and loaded (ETL) into a data warehouse. o Then, analysis, such as online analytical processing (OLAP), can be performed on cubes of integrated and aggregated data. 47 Data Warehouse: ETL o Extract o Transform • Simple mapping • Aggregation and normalization • Calculation o Load • Updating extracted data is frequently done on a daily, weekly, or monthly basis. 48 Operações: Roll Up o Exemplo de uma operação de Roll Up utilizando a dimensão Tempo. 49 Operações: Drill Down o Exemplo de uma operação de Drill Down utilizando a dimensão localização geográfica. 50 Operações: Drill Across o Ocorre quando o usuário pula um nível intermediário dentro de uma mesma dimensão. o O usuário executa um Drill Across quando ele passar de ano direto para trimestre ou mês. 51 Operações: Drill Through o Ocorre quando um usuário passa de uma informação contida em uma dimensão para uma outra. o Exemplo: um usuário está na dimensão tempo e, no próximo passo, começa a analisar a informação por região. 52 Operações: Slice and Dice o Slice é operação que corta o cubo, mas mantém a mesma perspectiva de visualização dos dados 53 Operações: Slice and Dice o Tabela 1apresenta as vendas de celulares e pagers. o Tabela 2representa uma fatia dos dados (operação que visualiza somente a produção de um tipo de produto –celulares) 54 Operações: Slice and Dice o Dice é a mudança de perspectiva da visão. o É a extração de um subcubo ou a interseção de vários slices. 55 Operações: Slice and Dice o Estávamos visualizando e analisando no sentido estado, cidade, ano, modelo de produto eproduto (Tabela 1). o Dice é a mudança de perspectiva para modelo de produto, o produto, ano, estado e cidade (Tabela2) 56 The Future of Data Integration o Crowdsourcing and “Human Computing” • Some conditions are hard for a computer to evaluate. • For example, the quality of extractions from the Web can be verified by humans, and schemas and data can be matched by crowds. o Lightweight Integration • We often face a task where we need to integrate data from multiple sources to answer a question that will be asked only once or twice. However, the integration needs to be done quickly and by people who may not have much technical expertise. • For example, consider a disaster response situation in which reports are coming from multiple data sources in the field, and the goal is to corroborate them and quickly share them with the affected public. 57 The Future of Data Integration o Visualizing Integrated Data • Users do not want to view rows of data but rather visualizations that highlight the important patterns in the data and offer flexible exploration. o Cluster- and Cloud-Based Parallel Processing and Caching • The ultimate vision of the data integration field is to be able to integrate large numbers of sources with large amounts of data ultimately approaching the scale of the structured part of the Web. 58 Data Integration: The Teenage Years (Alon Halevy, Anand Rajaraman, Joann Ordille) Emanoel Carlos ecgfs Referências o HALEVY, Alon; RA JARAMAN, Anand; ORDILLE, Joann. Data integration: the teenage years. In: Proceedings of the 32nd international conference on Very large data bases. VLDB Endowment, 2006. p. 9-16. o ZIEGLER, Patrick; DITTRICH, Klaus R. Three Decades of Data Integration - all Problems Solved?. In: Building the Information Society. Springer US, 2004. p. 3-12.Livro 1 o REEVE, April. Managing Data in Motion: Data Integration Best Practice Techniques and Technologies. Newnes, 2013. o DOAN, AnHai; HALEVY, Alon; IVES, Zachary. Principles of data integration. Elsevier, 2012. o LÓSCIO, Bernadette. Integração de Dados: Ontem, hoje e sempre. Disponível em: http://pt.slideshare.net/bernafarias/integracao-dados-ontem-hoje-e-sempre. Acesso em 6 de set de 2014. o LÓSCIO, Bernadette. Integração de Informações no Governo Eletrônico. Disponível em: http://pt.slideshare.net/bernafarias/integrao-de-informaes-no-governo-eletrnico. Acesso em 6 d set de 2014. o OLIVEIRA, Stanley R de M. OLAP: Online Analitical Processing. Disponível em: http://pt.slideshare.net/Valldo/olap-1p. Acesso em 6 de set de 2014. o WIDOM, Jennifer. Introduction to Databases . Disponível em: https://class.stanford.edu/courses/Engineering/db/2014_1/. Acesso em 6 de set de 2014.