Implementação do DW O problema e as soluções Grandes quantidades de dados => Métodos de acesso e processamento de interrogações eficientes Materialização de vistas para informação agregada: Identificar, explorar e actualizar de forma eficiente Índices: Index intersection, bit map indices, join indices SAD Tagus 2004/05 H. Galhardas Escolha dos agregados Objectivo: Minimizar o tempo de resposta Pergunta: Quais os agregados a materializar? Factores: Espaço de armazenamento – custo baixo! Custo de actualização – custo alto! Compromisso: Tempo de resposta vs custo de actualização SAD Tagus 2004/05 H. Galhardas Cube Operation Transform it into a SQL-like language (with a new operator cube by, introduced by Gray et al.’96) SELECT item, city, year, SUM (amount) FROM SALES CUBE BY item, city, year Need compute the following Group-Bys () (date, product, customer), (date,product),(date, customer), (product, customer), (date), (product), (customer) (city) (item) () 2^n cuboids (city, item) (city, year) (city, item, year) SAD Tagus 2004/05 H. Galhardas (year) (item, year) Efficient Data Cube Computation Goal: efficient computation of aggregations across many sets of dimensions Take into account amount of main memory available and computation time Data cube can be viewed as a lattice of cuboids n T (Li 1) i 1 Materialization of data cube Materialize every (cuboid) (full materialization), none (no materialization), or some (partial materialization) SAD Tagus 2004/05 H. Galhardas Partial materialization of cuboids Three factors must be considered: 1. Identify the subset of cuboids to materialize 2. Exploit the materialized cuboids during query processing 3. Take into account queries in the workload, their frequencies and accessing costs; cost of incremental updates, total storage requirements Popular approaches: use heuristics or materialize cuboids based on other frequently referenced cuboids How to use indexes and how to transform OLAP operations onto the selected cuboids Efficiently update the materialized cuboids during load and refresh SAD Tagus 2004/05 Explore parallelism and incremental update H. Galhardas Classification of agg. facts wrt the agg. functions Distributive: if the result derived by applying the function to n aggregate values is the same as that derived by applying the function on all the data without partitioning. Ex: count(), sum(), min(), max(). Algebraic: if it can be computed by an algebraic function with M arguments (where M is a constant), each of which is obtained by applying a distributive aggregate function. Ex: avg(), min_N(), standard_deviation(). Holistic: if there is no algebraic function with M arguments that characteriezes the computation. Ex: median(), mode(), rank(). SAD Tagus 2004/05 H. Galhardas Arquitecturas de servidor OLAP Relational OLAP (ROLAP) Usa SGBDs relacionais ou relacional extendido para armazenar e gerir os dados do datawarehouse e usa middleware OLAP para suportar funcinalidades específicas do OLAP. Inclui optimização suportada pelo SGBDR, implementa lógica de navegação de agregação e serviços/ferramentas adicionais Maior escalabilidade Multidimensional OLAP (MOLAP) Motor de armazenamento multidimensional baseado em arrays (sparse matrix techniques) Indexação rápida de dados sumarizados pré-calculados Hybrid OLAP (HOLAP) Flexibilidade: baixo nível: relacional, alto nível: array Specialized SQL servers Suporte especializado para interrogações SQL sobre esquemas em2004/05 estrela e floco de neve H. Galhardas SAD Tagus Some efficient cube computation methods ROLAP-based cubing algorithms (Agarwal et al’96) Array-based cubing algorithm (Zhao et al’97) Bottom-up computation method (Bayer & Ramarkrishnan’99) And others... SAD Tagus 2004/05 H. Galhardas ROLAP-Based Cubing Algorithm Value-based addressing: dimension values accessed by key-based addressing strategies Sorting, hashing, and grouping operations are applied to the dimension attributes in order to reorder and cluster related tuples Grouping is performed on some subaggregates as a “partial grouping step” Aggregates may be computed from previously computed aggregates, rather than from the base fact table SAD Tagus 2004/05 H. Galhardas Iceberg Cube Computing only the cuboid cells whose count or other aggregates satisfying the condition like HAVING COUNT(*) >= minsup Motivation Only a small portion of cube cells may be “above the water’’ in a sparse cube Only calculate “interesting” cells—data above certain threshold Avoid explosive growth of the cube Suppose 100 dimensions, only 1 base cell. How many aggregate cells if count >= 1? What about count >= 2? SAD Tagus 2004/05 H. Galhardas Multi-way Array Aggregation for Cube Computation Direct array addressing: dimension values accessed via the index of their corresponding array locations Partition arrays into chunks (a small subcube which fits in memory). Compressed sparse array addressing: (chunk_id, offset) Compute aggregates in “multiway” by visiting cube cells in the order which minimizes the # of times to visit each cell, thereby reducing memory 62 63 64 C c2 c3 61 access and storage cost. c1 c 0 29 What is the best traversing order to do multi-way aggregation? SAD Tagus 2004/05 B b3 B13 b2 9 b1 5 b0 H. Galhardas 45 30 46 31 47 32 14 15 16 1 2 3 4 a0 a1 a2 a3 A 48 60 44 28 56 40 24 52 36 20 Cuboids ABC – base cuboid – already computed AB, AC, BC – 2-D cuboids A, B, C – 1-D cuboid All – 0-D cuboid Many possible orderings w/ which chunks can be read into memory SAD Tagus 2004/05 H. Galhardas Example (1) C c3 61 62 63 64 c2 45 46 47 48 c1 29 30 31 32 c0 b3 B b2 B13 14 15 44 28 9 24 b1 5 b0 1 2 3 4 a0 a1 a2 a3 56 40 36 A SAD Tagus 2004/05 60 16 H. Galhardas 20 52 Example (2) C c3 61 62 63 64 c2 45 46 47 48 c1 29 30 31 32 c0 b3 B b2 B13 14 15 44 28 9 24 b1 5 b0 1 2 3 4 a0 a1 a2 a3 56 40 36 A SAD Tagus 2004/05 60 16 H. Galhardas 20 52 Multi-Way Array Aggregation for Cube Computation (Cont.) Method: the planes should be sorted and computed according to their size in ascending order. See the details of Example 2.12 (pp. 75-78) Idea: keep the smallest plane in the main memory, fetch and compute only one chunk at a time for the largest plane Limitation of the method: computing well only for a small number of dimensions If there are a large number of dimensions, “bottom-up computation” and iceberg cube computation methods can be explored SAD Tagus 2004/05 H. Galhardas Indexing OLAP Data: Bitmap Index Index on a particular column Each value in the column has a bit vector: bit-op is fast The length of the bit vector: # of records in the base table The i-th bit is set if the i-th row of the base table has the value for the indexed column Not suitable for high cardinality domains Base table Cust C1 C2 C3 C4 C5 Index on Region Index on Type Region Type RecIDAsia Europe America RecID Retail Dealer Asia Retail 1 1 0 1 1 0 0 Europe Dealer 2 2 0 1 0 1 0 Asia Dealer 3 1 0 0 3 0 1 America Retail 4 0 0 1 4 1 0 0 1 0 5 0 1 Europe Dealer 5 SAD Tagus 2004/05 H. Galhardas Indexing OLAP Data: Join Indices Join index: JI(R-id, S-id) where R (R-id, …) S (S-id, …) Traditional indices map the values to a list of record ids It materializes relational join in JI file and speeds up relational join — a rather costly operation In data warehouses, join index relates the values of the dimensions of a star schema to rows in the fact table. Ex: fact table: Sales and two dimensions city and product A join index on city maintains for each distinct city a list of R-IDs of the tuples recording the Sales in the city H. Galhardas SAD Tagus 2004/05 Example of join indexes SAD Tagus 2004/05 H. Galhardas Efficient Processing of OLAP Queries 1. Determine which operations should be performed on the available cuboids: transform drill, roll, etc. into corresponding SQL and/or OLAP operations, e.g, dice = selection + projection 2. Determine to which materialized cuboid(s) the relevant operations should be applied. 3. Exploring indexing structures and compressed vs. dense array structures in MOLAP SAD Tagus 2004/05 H. Galhardas Bibliografia (Livro) Data Mining: Concepts and Techniques, J. Han & M. Kamber, Morgan Kaufmann, 2001 (Capítulo 2 no livro e Capítulo 3 nos drafts) (Artigo) Data Cube: A relational aggregation operator generalizing group-by, cross-tab, and sub-totals, J. Gray et al, DMKD 1(1), 1997 (Artigo) On the computation of multidimensional aggregates, Agarwal et al, VLDB, 1996 SAD Tagus 2004/05 H. Galhardas