Implementação do DW
O problema e as soluções
Grandes quantidades de dados => Métodos de acesso e
processamento de interrogações eficientes

Materialização de vistas para informação agregada:
Identificar, explorar e actualizar de forma eficiente

Índices:
Index intersection, bit map indices, join indices
SAD Tagus 2004/05
H. Galhardas
Escolha dos agregados
Objectivo: Minimizar o tempo de resposta
Pergunta: Quais os agregados a materializar?
Factores:


Espaço de armazenamento – custo baixo!
Custo de actualização – custo alto!
Compromisso:
Tempo de resposta vs custo de actualização
SAD Tagus 2004/05
H. Galhardas
Cube Operation

Transform it into a SQL-like language (with a new operator
cube by, introduced by Gray et al.’96)
SELECT item, city, year, SUM (amount)
FROM SALES
CUBE BY item, city, year

Need compute the following Group-Bys
()
(date, product, customer),
(date,product),(date, customer), (product, customer),
(date), (product), (customer)
(city)
(item)
()
2^n cuboids
(city, item)
(city, year)
(city, item, year)
SAD Tagus 2004/05
H. Galhardas
(year)
(item, year)
Efficient Data Cube
Computation

Goal: efficient computation of aggregations across
many sets of dimensions

Take into account amount of main memory
available and computation time

Data cube can be viewed as a lattice of cuboids
n
T   (Li 1)
i 1

Materialization of data cube

Materialize every (cuboid) (full materialization), none (no
materialization), or some (partial materialization)
SAD Tagus 2004/05
H. Galhardas
Partial materialization of
cuboids
Three factors must be considered:
1.
Identify the subset of cuboids to materialize


2.
Exploit the materialized cuboids during query
processing

3.
Take into account queries in the workload, their frequencies
and accessing costs; cost of incremental updates, total
storage requirements
Popular approaches: use heuristics or materialize cuboids
based on other frequently referenced cuboids
How to use indexes and how to transform OLAP operations
onto the selected cuboids
Efficiently update the materialized cuboids during
load and refresh

SAD Tagus 2004/05
Explore parallelism and incremental update
H. Galhardas
Classification of agg. facts
wrt the agg. functions
Distributive: if the result derived by applying the function to n
aggregate values is the same as that derived by applying
the function on all the data without partitioning.
Ex: count(), sum(), min(), max().
Algebraic: if it can be computed by an algebraic function with
M arguments (where M is a constant), each of which is
obtained by applying a distributive aggregate function.
Ex: avg(), min_N(), standard_deviation().
Holistic: if there is no algebraic function with M arguments
that characteriezes the computation.
Ex: median(), mode(), rank().
SAD Tagus 2004/05
H. Galhardas
Arquitecturas de servidor OLAP
Relational OLAP (ROLAP)



Usa SGBDs relacionais ou relacional extendido para armazenar e
gerir os dados do datawarehouse e usa middleware OLAP para
suportar funcinalidades específicas do OLAP.
Inclui optimização suportada pelo SGBDR, implementa lógica de
navegação de agregação e serviços/ferramentas adicionais
Maior escalabilidade
Multidimensional OLAP (MOLAP)


Motor de armazenamento multidimensional baseado em arrays
(sparse matrix techniques)
Indexação rápida de dados sumarizados pré-calculados
Hybrid OLAP (HOLAP)

Flexibilidade: baixo nível: relacional, alto nível: array
Specialized SQL servers
Suporte especializado para interrogações SQL sobre esquemas
em2004/05
estrela e floco de neve H. Galhardas
SAD Tagus

Some efficient cube
computation methods
ROLAP-based cubing algorithms (Agarwal et
al’96)
 Array-based cubing algorithm (Zhao et al’97)
 Bottom-up computation method (Bayer &
Ramarkrishnan’99)
 And others...

SAD Tagus 2004/05
H. Galhardas
ROLAP-Based Cubing
Algorithm

Value-based addressing: dimension values accessed by
key-based addressing strategies

Sorting, hashing, and grouping operations are applied to the
dimension attributes in order to reorder and cluster related
tuples

Grouping is performed on some subaggregates as a “partial
grouping step”

Aggregates may be computed from previously computed
aggregates, rather than from the base fact table
SAD Tagus 2004/05
H. Galhardas
Iceberg Cube
Computing only the cuboid cells
whose count or other aggregates
satisfying the condition like

HAVING COUNT(*) >= minsup

Motivation



Only a small portion of cube cells may be “above the
water’’ in a sparse cube
Only calculate “interesting” cells—data above certain
threshold
Avoid explosive growth of the cube

Suppose 100 dimensions, only 1 base cell. How many
aggregate cells if count >= 1? What about count >= 2?
SAD Tagus 2004/05
H. Galhardas
Multi-way Array Aggregation
for Cube Computation

Direct array addressing: dimension values accessed via the index of their
corresponding array locations

Partition arrays into chunks (a small subcube which fits in memory).

Compressed sparse array addressing: (chunk_id, offset)

Compute aggregates in “multiway” by visiting cube cells in the order which
minimizes the # of times to visit each cell, thereby reducing memory
62
63
64
C c2 c3 61
access and storage cost.
c1
c 0 29
What is the best
traversing order
to do multi-way
aggregation?
SAD Tagus 2004/05
B
b3
B13
b2
9
b1
5
b0
H. Galhardas
45
30
46
31
47
32
14
15
16
1
2
3
4
a0
a1
a2
a3
A
48
60
44
28 56
40
24 52
36
20
Cuboids





ABC – base cuboid – already computed
AB, AC, BC – 2-D cuboids
A, B, C – 1-D cuboid
All – 0-D cuboid
Many possible orderings w/ which chunks
can be read into memory
SAD Tagus 2004/05
H. Galhardas
Example (1)
C
c3 61
62
63
64
c2 45
46
47
48
c1 29
30
31
32
c0
b3
B
b2
B13
14
15
44
28
9
24
b1
5
b0
1
2
3
4
a0
a1
a2
a3
56
40
36
A
SAD Tagus 2004/05
60
16
H. Galhardas
20
52
Example (2)
C
c3 61
62
63
64
c2 45
46
47
48
c1 29
30
31
32
c0
b3
B
b2
B13
14
15
44
28
9
24
b1
5
b0
1
2
3
4
a0
a1
a2
a3
56
40
36
A
SAD Tagus 2004/05
60
16
H. Galhardas
20
52
Multi-Way Array Aggregation
for Cube Computation (Cont.)
Method: the planes should be sorted and computed
according to their size in ascending order.


See the details of Example 2.12 (pp. 75-78)
Idea: keep the smallest plane in the main memory, fetch and
compute only one chunk at a time for the largest plane
Limitation of the method: computing well only for a small
number of dimensions

If there are a large number of dimensions, “bottom-up
computation” and iceberg cube computation methods can be
explored
SAD Tagus 2004/05
H. Galhardas
Indexing OLAP Data:
Bitmap Index





Index on a particular column
Each value in the column has a bit vector: bit-op is fast
The length of the bit vector: # of records in the base table
The i-th bit is set if the i-th row of the base table has the value for
the indexed column
Not suitable for high cardinality domains
Base table
Cust
C1
C2
C3
C4
C5
Index on Region
Index on Type
Region Type RecIDAsia Europe America RecID Retail Dealer
Asia
Retail
1
1
0
1
1
0
0
Europe Dealer 2
2
0
1
0
1
0
Asia
Dealer 3
1
0
0
3
0
1
America Retail
4
0
0
1
4
1
0
0
1
0
5
0
1
Europe Dealer 5
SAD Tagus 2004/05
H. Galhardas
Indexing OLAP Data: Join
Indices


Join index: JI(R-id, S-id) where R (R-id,
…)  S (S-id, …)
Traditional indices map the values to a
list of record ids


It materializes relational join in JI file and
speeds up relational join — a rather costly
operation
In data warehouses, join index relates
the values of the dimensions of a star
schema to rows in the fact table.
Ex: fact table: Sales and two dimensions city
and product
A join index on city maintains for each distinct
city a list of R-IDs of the tuples recording the
Sales in the city
H. Galhardas
SAD Tagus 2004/05
Example of join indexes
SAD Tagus 2004/05
H. Galhardas
Efficient Processing of
OLAP Queries
1.
Determine which operations should be performed
on the available cuboids:

transform drill, roll, etc. into corresponding SQL and/or
OLAP operations, e.g, dice = selection + projection
2.
Determine to which materialized cuboid(s) the
relevant operations should be applied.
3.
Exploring indexing structures and compressed vs.
dense array structures in MOLAP
SAD Tagus 2004/05
H. Galhardas
Bibliografia

(Livro) Data Mining: Concepts and Techniques, J.
Han & M. Kamber, Morgan Kaufmann, 2001
(Capítulo 2 no livro e Capítulo 3 nos drafts)

(Artigo) Data Cube: A relational aggregation operator
generalizing group-by, cross-tab, and sub-totals, J.
Gray et al, DMKD 1(1), 1997

(Artigo) On the computation of multidimensional
aggregates, Agarwal et al, VLDB, 1996
SAD Tagus 2004/05
H. Galhardas
Baixar
Document

Slides

poster

Document

Slides

poster

Analysis of Pel Decimation and Technology Choices to

Alistair McIntosh

TREINAMENTO PEDAGÓGIO 26 de Fevereiro de 2009

CSF III Management Commission

Apresentação do PowerPoint