Data Warehouse and Data Mining - Unit 4
Data Warehouse and Data Mining - Unit 4
TECHN OLOGY
CHAPTER 0UTUNE
th
A data cube allows data to be viewed in multiple dimensions.A dimension are entities wi res?el:1
to which an organiz..1tion wants to keep rt>cords. For example, in store sales recotd , dimensions allo...,_
the store to keep track of things like monthly sales of items and the branches and locations. A
multidimensional database helps to provide data-related answers to complex business queries
quickly and accurately. A data cube is tl three-dimensional (3D) or higher set of values that are
typically used to describe the time series of data from an image. It is an abstraction of data to analy~
aggregated information from a number of points of view.
A multi-.dimensional architecture is a data cube. The data cube is an abstract of data for displaying
aggregated data from a variety of viewpoints. As the measure attribute, the dimensions are
aggregated, as the remaining dimensions are known as the function attributes. In a multidimensionaJ
way, data is viewed on a cube. It is possible to display the aggregated and summarized facts With
variables or attributes. This is the specification where OLAP plays a role and data warehouse
systems provide OLAP tools for interactiveanalysis of multidimensional data at varied granularity
lewls.For simple data analysis, data cubes are widely used. It is used to represent data as such
quantities of company needs along with dimensions. Each cube dimension reflects some of the
databases characteristics, such as revenue every day, month, or year.
Although the data cube concept was originally intended for OLAP, it is also useful for data mining
Multidimensional data mining is an approach to data miningthat integrates OLAP-based data
/ analysis with knowledge discovery techniques. It isalso known as exploratory multidimensional data
mining and online analytical mining (OLAM). It searches for interesting patterns by exploring the data
in multidimensionalspace. This gives users the freedom to dynamically focus on any subset of
interestingdimensions. Users can interactively drill down or roll up to varying abstraction levels
tofind classification models, clusters, predictive rules, and outliers.
Example:
Table 4.1:3D view of sales data according to time, product and location
I
location= "Pokhara" Location = "Kawasotl" Location • "Dhanpdl" Locatlon="Mahendranagar"
Product Product Product Product
QI 1032 924 59 789 1034 1048 45 1002 940 795 58 728 812 1023 30 501
Q4 1129 992 63 870 1142 1091 54 984 978 864 59 784 927 1038 38 580
Data Cube Technology O CHArTEI 4 I 11
(3#
·# ~ K .. .,
~
✓ Dhr,;..il,
Mahcndr. • Jf
1V PC AJ> SSD
Product (types}
Figure 4. l :30 data cube of sales data (rneo,ure displayed i, rupee,_,old in thousand,)
Data cubes are specifically grouped into two classifications. These are described below:
a) Multidimen sional Data Cube: Centered on a structure where the cube is patterned as
a multidimen sional array, most OLAP products are created. Compared to other
methods, these multidimen sional OLAP (MOLAP) products typically provide a better
performanc e, primarily because they can be indexed directly into the data cube
structure to capture data subsets. The cube gets sparser as the number of dimensions is
larger. That ensures that no aggregated data would be stored in multiple cells that
represent unique attnoote combinations. This in turn raises the storage needs, which
can at times exceed undesirable thresholds, rendering the MOLAP solution untenable
for massive, multi-dimen sional data sets. Compression strategies may help, but their
use may damage MOLAPs natural indexing.
b) Relational OLAP: The relational database architecture is used by Relational OLAP.
Compared to a multidimen sional array, the ROLAP data cube is used as a series of
relational tables (approximately twice as many as the number of dimensions). Each of
these columns, referred to as a cuboid, denotes a particular perspective.
1-D cuboids
C
BC 2-D cuboids
3D (base) cuboid
ABC
Figure 4.2: Lattice cuboid, making up of 3-D data cube with the dimensions A, 8, and C for some
aggregate mea,ure, M.
Figure 4.2 shows a 3-D data cube for the dimensions A, B, and C, and an aggregatemea sure, M. A
data cube is a lattice of cuboids. Each cuboid represents a group-by.ABC is the base cuboid,
containing all three of the dimensions. Here, the aggregatemea sure, M, is computed for each possible
combination of the three dimensions. Thebase cuboid is the least generalized of all of the cuboids in
the data cube. The mostgeneraliz ed cuboid is the apex cuboid, commonly represented as all. 11
contains onevalue-it aggregates measure M for all of the tuples stored in the base cuboid. To
drilldown in the data cube, we move from the apex cuboid, downward in the lattice. Toroll up, we
move from the base cuboid, upward. For the purposes of our discussionin this chapter, we will
always use the term data cube to refer to a lattice of cuboidsrather than an individual cuboid.
Doto C' 11lw 1'11 h11ology 0 CHAPTER 4 I 83
pntn Cubo MntorlnHz nUon (Pro-oomp utntion)
I ln'11"' lll1' th11'1 ' 1 h11 i11" 1 1111 d,1111 11111!•1t1d1i,1l11111 1111d llwy llll' ,1:1 lollnw:;:
I. No ~lat,•alull:r.,tiun1 I >11 1101 p11• 111111111111' ,11,y of the " n11n-h,1s1•" cuhoids. Thiq le.ids to
1111111'11ling 1•:-:1w11•,ivl' 111111lidinw11-ii1111,1l 11ggu•gnl1•s 11111h1• fly, whid11·,111 lw 1•.xtrcmcly slow.
2, 1'1111 ul.\h•il,,11:raliun: 1'11• 111111p11l1• ,111 nl 1h1 111hoida. l lw 11'sulli"g l,1ttkc of computed
1
, 11l•1 11ds Is 1i'll'1 tl'd lo ,1•, 1111• 11111111lw. 'I his, l111i1 ,, typil',1lly 11•quirl'S huge ,,mounts ol nwmory
t-J'•" ,, m 01d1•1 In st,111• nil ut th,• p,,,, ump1tll'd, uhmds.
3. 1'1uli:tl 1\1,lh'ri,,lb,,tinn: ~\•l,,,llwly ,ompull• ,1 p1op1•r suhi.il'I ol till' whole set of possible
, 11\)i,hls. Alt 1• 111,1tl\'i'ly ",, 111,1y, ,,mpuh• ,1 su h ,t'I ol tlw eulw, whi, h cont,1ins only some user-
sp1'1 ili,•d , 1ilt'• inn, II ~hn11ld , 11n-.1d1•1 th11•1• 1111 lm'i (,1)it/1•11tijy tltc s11/hc/ cf c11/1oitf..; to 111atainliZt';
(h) npfoil tlw 111nt,•1 i,1l1d·,I , 11/,oi,I• ,/111111.~ q111•ry I" cJc r::,s111g; ,rnd (1.') efjicie11 tly 11ptl11tc the 11111/crialized
, uboitl~ d11,i11~ /, 1,1,I ,111,l 1,11,•s/1.
A c-cll in tlw t,.,~,• cuboid is ,\ b,tSl' cell A n•ll fmm ,1 non b,,-,c" cuboid is .1n aggregate cell. An
11
aggrl'g:lh' c,•11 ,1g~n.•g.itl•s ovc, o,w or mmc dimensions, where c.1ch aggrcgateddim ension is
11
indic-ak•d by a "• in llw c1.•II nolalion. Su pposc we have ,rn n-dimcnsiona ldnta cube. Let
a• (n1, n1., ...,11,., m1•11s111cs) be ,1 c-l'll from orw of the cuboids m,1kingup the data cube. We say that a is
on m-dimension al cell (i.e., from an m-dimcnsion, \lcuboid) if exactly 111 (111 s 11) values among {a,,
a21 •••,n.J nre not '' •" . If 111 -= 11, then ais ., base cell; otherwise, it is an aggregate cell (i.e., where
m<u).for example, if we take 30-cubc all J-0, 2-0 cells are aggregate cells and 3-0 is Base Cell. 1-D
and 2-D cells arc ancestors of 3-0 Cell whereas 3-D is descendent cell.
Oifferentcube rnaterializatio n include: full cube, Iceberg cube, Closed cube and Shell cube.
J. Full cube:ln order to ensure fast OLAP, it is sometimes desirable to precompute the full cube
(i.e., all the cells of all of the cuboids for a given data cube). This, however, is exponential to
the number of dimensions. Thal is, a data cube of n dimensions contains 2° cuboids. There are
even more cuboids if we consider concept hierarchies for each dimension. In addition. the size
of each cuboid depends on the cardinality of its dimensions. Thus, precomputati on of the full
cube can require huge and oftenexcessive amounts of memory.
2. Iceberg cube:lt contains only those cuboid cells whose measures satisfies the aggregate or
iceberg condition. The aggregate condition could be, minimum support or a lower bound on
average, min or max. ll is called an iceberg-cube because it contains only some of the cells of
the full cube like the tip of an iceberg. The purpose of the Iceberg-cube is to identify and
compute only those values that will most likely be required for decision support queries. The
aggregate condition specifies which cube values .1re more meaningful and should therefore be
stored. ·1 his is one solution to the problem of computing versus storing data cubes.
3. Closed cube: It consists of only closed tells. A cell llMt has no degcendant cell th,,n it is closed cell.
4. Shell cube: Another 'ilrilll•gy for p,uti,1l m,1t1•riah,ation 1s to pre-compull' only portion and
fragm(•nts of Uw cul><• 'iiwll (cuboids involving ~mall no. of dimensions such as 3 to 5), based
on the cuboid'> of intl•n•st. For cxampll',slwll fr,1gmenl tube ABC contains 7 cuboids:{A, B, C.
AB, AC, BC, AIK}
With different kinds of cubes ns dl•scribed ,,bow, we can expect that there are a goodnumber of
methods for efficient computation. In ge1wr,,l, there are two basic data structures used for storing
cuboids. Relational t,1blcs are used ,1s the basic data structure for theimplement ation of relational
N Data Warehousing and Data Mining
OLAP (ROLAP), while multidimensional arrays are usedas the basic data ~tructure in
multidimensional OLAP (MOLAP) Although ROI AP andMOLAP may each explore ~erent cube
· · t· ,, tr' ks" b h ed among the different data
computation techruques, some opt1m1za ton 1c can e s ar
· · · hn' f the efficient computation of
representations. The followmg are genera l optim1zallon tee 1ques or
data cubes.
1. Sorting, hashing, and grouping: Sorting, hashing, and grouping operations should be app~ed
to the dimension attributes in order to reorder and cluster related tuples. In cube computation,
· h th e set of dimension values
aggregation 1s perfom,ed on the tuples (or cells) that s are e sam ·
· · · · d · 0 perations to access and group
Thus, it is important to e.,plore sortmg, hashing, an grou ping
· · f ch t s For example to compute total
such data together to facilitate computation o su aggrega e · '
· is ·
· more effic1ent t tuples or cells by branch, and then
sales by brand,, dav, and product, 1t to sor .
· · · th d t ame Such implementations can be
by Jay, and then group them according to epro uc n ·
e.xtended to data cube computation.
'Ibis technique can also be further extended to perform:
• Shared-sorts: Sharingsorting costs across multiple cuboids when sort-based methods
are used
Shared-partitions: Sharing the partitioning cost across multiple cuboids when hash.
•
based algorithms are used.
2. Simultaneous aggregation and caching intermediate results: In cube computation, it is
efficient to compute higher-level aggregates from previously computed lower-level
aggregates, rather than from the base fact table. Moreover, simultaneous aggregation from
cached intermediate computation results may lead to the reduction of expensive disk I/0
operations.
For example, to compute sales by branch, we can use the intermediate results derived from the
computation of a lower-level cuboid, such as sales by branch and day. 1his technique can be
further extended to perform amortized scans i.e., computing as many cuboids as possible at
the same time to reduce disk reads.
3. Aggregation from the smallest child, when there exisbnultiple child cuboids: When there
exist mulbple child cuboids, it is usually more efficient to compute the desired parent (i.e.,
more generalized) cuboid from the smallest,previously computed child cuboid.
For example, to compute a sales cuboid, Chra11ch, when there exist two previously computed
cuboids,Cf!,much ~,, and Ctb,nnc/1, ilrm/, it is obviously more efficient to compute C bnmd, from the
former than from the latter if there are many more distinct items than distinct years.
4. The Apriori pruning method can be explored to compute iceberg cubes efficiently: The
Apriori property, in the context of data cubes, states as follows: If a given cell does not satisfy
minimum support, then no descendant (i.e., more specialized or detailed version) of the cell will satisfy
minimum support either. This property can be used to substantially reduce the computation of
iceberg cubes.
Recall that the specification of iceberg cubes contains an iceberg condition, which is a
constraint on the cells to be materialized. A common iceberg condition is that the cells must
satisfy a minimum support threshold such as a minimum count or sum. In this situation, the
Apriori property can be used to prune away the exploration of thecell' s descendants.
►
Dala Cube Technology 0 CHAPTER 4 1 15
DATA CUBE COMPUTATION METHODS
Data cube computation is an essential task in data warehouse implementation. The pre-computation
of all or part of a data cube can greatly reduce the response time and enhance the performance of
oLAP, However, such computation is challenging because it may require substantial computation
time and storage space.This section explores efficient methods for data cube computation which are
as follows:
1. The multi-way array aggregation (Multi Way) method for computing full cubes.
z. A method known as BUC, which computes iceberg cubes from the apex cuboid downward.
3. TI,e Star-Cubmg method, which integrates top-down and bottom-up computation.
4. High dimension OLAP
•
.
Limitation of the method: computing well on y
1 for a small number of dimensions
t tion" and iceb
. "bottom-up compu a erg Cti~
• If there are a large number of dimensions,
computation methods can be explored . and c. The 3-D .
. • the three dimensions A, B, array ts
Example: Consider a 3-D data .uray cont,unmg .. d . t chunks as shown in figure 4
·, partillone m O 64
nd a B is organized into f 3
· .
partitioned into small chunks. Herc, t1w arrclY
15
~ '"' b c
, pond to the .-;ub-cubesaoboCo, 111boC0, .... .... , 03 3 3'
respective1y.
2, .....,~cone.
H
A-B Plane
* * * *
* * * *
!< * *
13 14 15
b: 9 10 11 12
B
bi 5 6 7 8
/ bo 2 3 4
ao a1
A
a2 clJ
Figure 4.3: A 3-0 array for the dimensions A, 8, C, organized into 64 cltunlcs.
Suppose that the cardinality and size of the array of A, B, and C is 40, 400, and 4000, respectively.
Thus, the size of each partition 1s 10, 100, and 1000, respectively. Full materialization ofthe
corresponding data cube involves the computation of all of the cuboids defining thiscube i.e., {apex,
A, B, C, AB, AC, BC, ABCJ.
Let's look how the multiway array aggregation technique is used in this computation?There are
many possible orderings with which chunks can be read into memoryfor use in cube computation.
Consider the ordering labelled from 1 to 64, shown in figure 4.3.Suppose we would like to compute
the boeo chunk of the BC cuboid. We allocate space for this chunk in chunk memory. By scannil18
chunks 1 to 4 of ABC, the boco chunk is computed. That is, the cells for boco are aggregated over ao to
a3•The chunk memory can then be assigned to the next chunk, b1co, which completes its aggregatioll
after the scanning of the next four chunks of ABC: 5 to 8.Continuing in this way, the entire BC cuboid
Data Cube Technology 0 •
can be romputed. Therefore, only one chunk of BC needs to be in memory, at a time, for the
computation of all of the chunk, of BC.In computing the BC cuboid, we will have seamed each of
t11t6'chunks.
re 1.-I l'HClnning • 11 of theee chunb for the computation of cuboi•, such as AC and AB, the
,,,.n;.-y a,rrqn,t.lion or limidton«Jt11 ll~Hon idea comes in.For example, when chunk 1 (i.e.,
~ II being acanned (say, for the computation of the 2-0 chunk boea of BC, a described above), all
tl the odler' 2-D chunb fflating to . -0 can be stmuJtaneou,Jy computed. That t., when ~ is
..._ onned, eadl of the thNe chunb, baco, lloCo. and loba, on the three 2-0 aggregation plalB, BC
AC and AB, ahould be computed then u well In other word9, muJtiway 1.1m.putation
~ aggregates to Heh of the 2-0 planes while a 3-0 chunk is in memory.
Now left look at how different orderlnp of chunk ICalllWII and of cuboid coa.patation can affect
lllt ..,..0 data cal,e compatatlon efftdency?The po98ible orderings are:
a) Ordering from 1, 2, 3, 4, ......... 64. By acanning chunks of 1 to 4 of ABC the lroc. is
computed (agpega1- over ,re, to ,.,). Multiway aggregation med to avoid rescan al
particular chunks. lo avoid 3-0 thunk more than once into memory, ta -••• •
aeqailement for holding all relennt- 2-D plaw
• whole AB plane+ one row of AC+ one chunk of BC plane
• 40 X 400 + 40 ,c 1000 + l(X)' X 1000 • 1,56_#)0 melld~ dDill.
Orderiltg .from 1, 17, 3.1, -49, 5, 21, 37, Si. 111e wfafmu-. 4q.dllllDIIII fa: t IC a •
11
b)
aelffntZ.O plwl
• whale BC plw + - - at AC+ one=
• 0 X «JOO ♦ 40 X 1000 ♦ 10 ,c 100 • 16,'l
iw, the IIIOlt 4fflrlent one ii 1, 2, ......-6' .
anits.
•
R Data Warehousing and Data Mining
here we can move from detailed, low
highly aggregated cell to lower, more detailed cells. Roll-up- w ·
level cells to higher level, more aggregated cells. .W BUC
, t tion. Unlike Multi- ay, constructa
BUC is an algorithm for sparse and iceberg cube compu a . f'gure 4 5 Partitioning analy'1.;.
b ·d as shown in 1 · • ~
the cube from the apex cuboid toward the b a~ cu 01 . s·nce recursive analysis in BUc
. . BUC' cube computation. I
and operation are examinahon costs m s . nsive BUC uses the bottom-up
. . .. 1'd d gregation are expe · .
does not cut the mput sne, both d1, e an ag . to A-pnori prurung strategy. If
. mputation by recumng
approach that allows to prune unncce~sary co . y of its descendants. The Iceberg
• . . l then neither can an
a l"iven cell does not satisfy mnumum suppor' . diti'on
cube problem is to compute all group-bys at sa 5fy an iceberg con
o· th ti ·
' All
Figure 4.5: IUC's exploration for the computation of a 3-D data cube. Note that the computation starts
/ from the apex cuboid
L
►
Data Cube Technology O CHAPTER 4 I 19
distinct values, b1, bi, bJ, b4;
Suppose that dimension A has four distinct values, a1, a2, a3, a1; B has four
and C has two distinct values, c,, Cz To construct the iceberg cube,
we must compu te every
that have 3 tuples). To do
combination of the grouping attributes that satisfy minimum suppo rt (i.e.,
g to the cell
.so, BUC scans the input, aggregating the tuples to obtain a count for all, correspondin
(♦, *, *).
g one tuple
• Starting with A dimension value, a1, the a1 partition is aggregated, creatin
for the A group-by, corresponding to the cell (a 1, *, *)
ive call is made
• Suppose (a1, *, *) satisfies the minimum support, in which case a recurs
on the partition for a1
um suppo rt
• Suppose the cell count for (a1, b1, *) is 2, which does not satisfy the minim
partitioning this
so, BUC prunes any further exploration of (a1, b1, *).That is, it avoids
cell on dimension B
• It backtracks to the a1, partition and recurses on (a1, b2, *), and so on
b1
C1
bi
a1 C2
hJ
b4
a2
clJ
c4
data set.
Figure 4.6: BUC partitioning snapshot given an examp le 30
to skew in the data. Ideally,
The performance of BUC is sensitive to the order of the dimensions and
sions should be proces sed in
the most discriminating dimensions should be processed first. Dimen
r the partitions are, and thus,
order of decreasing cardinality. The higher the cardinality is, the smalle
r oppor tunity for prunin g.
the more partitions there will be, thereby providing BUC with greate
it is for prunin g.
Similarly, the more uniform a dimension is (i.e., having less skew), the better
m-up
The Star-Cubing meth od, which integrates top-down and botto
computation
explores shared dimen sions.
Star Cubing integrate the top-down and bottom-up methods. It
means cuboid AC has shared
Example, dimension A is the shared dimension of AC and C. AC/C
cuboid C is compu ted
dimensions C. Star cubing allows for shared computations. e.g.,
but with the bottom -up sub-
simultaneously as AC. Star Cubing aggregate in a top-down manne r
sions grow in bottom -up
layer underneath which will allow Apriori pruning. Its shared dimen
fashion. As shown in figure 4.7.
■ Data 'Warchousmgand Data M1mng
All
BC/BC
. ·th
fivure ◄.7: Stor-C1.1bing: bottom-up computation w1 to p-down expansio n of shared dimension s.
/ a huge cube,v.ilereas setting the threshold too high may invalidate many useful applications.
ThinLan iceberg cube cannot be incrementally updated. Once an aggregate cell falls belowthe
ioeberg threshold and is pruned, its measure value is lost. Any incremen tal updatew ould require
recomput ing the cells from scratch. This is extremely undesirab le for largereal- life applications
v.itere incremental appendin g of new data is the norm.
One possible solution, which has been implemented in some commerc ial data warehou se systems,
is
to compute a thin cube shell.For example, we could compute all cuboids with three dimensio ns
or
less in a 60-dimensional data cube, resulting in cube shell of size 3. The resulting set of cuboids
would require much less computat ion and storage than the full 60-dimen sional data cube.However,
there are two disadvant ages of this approach . First, we would still need to compute
60(:: 3 -:- «c2 + 60 = 36,050 cuboids, each with many cells. Second, such a cube shell
does not support
high dimensio nal OLAP because (1) it does not support OLAP on four or more dimensio ns and (2) it
cannot even support drilling along three dimensions.
Instead of computin g a cube shell, we can compute only portions or fragment s of it. This section
discusses the shell fragment approach for OLAP query proce~in g. It is based on the following key
observati on about OLAP in high-dim ensional space. Although a data cube may contain many
dimensio ns, most OLAP operation s are performe d on only a small number of dimensio ns at a tilne,
In other words, an OLAP query is likely to ignore many dimensio ns (i.e., treating them as irrelevant),
fix some dimensio ns (e.g., using query constants as instantiat ions), and leave only a few to be
manipula ted (for drilling, pivoting, etc.). This is because it is neither realistic nor fruitful for anyone
to compreh end the changes of thousand s of cells involving tens of dimensio ns simultane ously in
a
1
high-dim ensional space at the same time. -
Data Cube Technology O CHAPTER 4 I 91
RACTERIZATION
;\'f1'RJ1JU1'E O RIENTED INDUCTION FOR DATA CHA
~::::============================~~~~~~~========
d In <luction (AOI) approach to data generalization and summ ariza tion - base d
he ;\t1ribt1ll•· Oricntc
1 s before the introduction of the data cube
cli.tr,1ctcriz,1tion was fir l proposed m 1989, a few years
idered as a data warehouse - based, pre
,,ppro,ich The cl,ila cube appr oach can be cons
rms off-line aggregation before an OLAP or
,on1put.itio11c1l oriented, materialized approach. It perfo
s other hand, the attribute-oriented indu ction
d,it,1 rni111ng query is ubmilted for processing. On the
ase query - oriented, generalized - based,
appro.ich, ,11 kw;L in its initia l proposal, a relational datab
distinguishing the two
line da ta <malysis technique. However, there is no inherent barrier
011
e pre-compu tation.
,,pproaches based on online aggregation versus offlin
G BETWEEN DIFFERENT
MINING CLASS COMPARISON: DISCRIMINATIN
CLASSES
A,Ezercis)
1. Explain about Oass Comparison Methods and Implementation?
2. Explain about Presentation of Class Comparison Descriptions
3. Discuss Oass Description: Presentation of Both Characterization and Comparison
4. Explain Mining Descriptive Statistical Measures in Large Databases:
5. Describe efficient method for data cube computation with suitable example.
6. What is Cube materialization? Define Full cube, Iceberg cube, closed cube and Shell cube.
7. Describe general strategies for cube computation. Explain.
8. Describe attribute-oriented induction for data characterization.
9.
Define mining class comparison: Discriminating between different classes with suitable
I example.
10.
Describe Multi-Way Array Aggregation (Multi-Way) Method for Computing Full Cubes with
suitable example.
11. Describe Advantages and disadvantages of Data Cube.
12. What are the application areas of data cube? Explain.
13.
Describe different data cube computation methods with suitable example.
14.
Describe Basic Principles of Attribute Oriented Induction.
15. Describe Star-Cubing method with suitable example.
16.
Describe BUC method for computing iceberg cubes from the apex cuboid downward.
17. Describe High Dimension OLAP.
18. Describe Multidimensional Data Cube with suitable example.
19. Define Relational OLAP with suitable example.
20.
Describe different approach is developed for the array-based cube constructi'
on.
□□□