0% found this document useful (0 votes)
33 views14 pages

Data Warehouse and Data Mining - Unit 4

Uploaded by

zrimreaper
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views14 pages

Data Warehouse and Data Mining - Unit 4

Uploaded by

zrimreaper
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

DATA CUBE

TECHN OLOGY

CHAPTER 0UTUNE

0 Efficient method for data cube computation


0 Cube materialization (Introduction to Full cube, Iceberg cube, Closed cube, Shell cube)
0 General strategies for cube computation
0 Attribute oriented induction for data characterization
0 Mining class comparison
0 Discriminating between different classes
. Data Warehousing and Data Mining

INTRODUCTION OF DATA CUBE TECHNOLOGY

th
A data cube allows data to be viewed in multiple dimensions.A dimension are entities wi res?el:1
to which an organiz..1tion wants to keep rt>cords. For example, in store sales recotd , dimensions allo...,_
the store to keep track of things like monthly sales of items and the branches and locations. A
multidimensional database helps to provide data-related answers to complex business queries
quickly and accurately. A data cube is tl three-dimensional (3D) or higher set of values that are
typically used to describe the time series of data from an image. It is an abstraction of data to analy~
aggregated information from a number of points of view.
A multi-.dimensional architecture is a data cube. The data cube is an abstract of data for displaying
aggregated data from a variety of viewpoints. As the measure attribute, the dimensions are
aggregated, as the remaining dimensions are known as the function attributes. In a multidimensionaJ
way, data is viewed on a cube. It is possible to display the aggregated and summarized facts With
variables or attributes. This is the specification where OLAP plays a role and data warehouse
systems provide OLAP tools for interactiveanalysis of multidimensional data at varied granularity
lewls.For simple data analysis, data cubes are widely used. It is used to represent data as such
quantities of company needs along with dimensions. Each cube dimension reflects some of the
databases characteristics, such as revenue every day, month, or year.
Although the data cube concept was originally intended for OLAP, it is also useful for data mining
Multidimensional data mining is an approach to data miningthat integrates OLAP-based data
/ analysis with knowledge discovery techniques. It isalso known as exploratory multidimensional data
mining and online analytical mining (OLAM). It searches for interesting patterns by exploring the data
in multidimensionalspace. This gives users the freedom to dynamically focus on any subset of
interestingdimensions. Users can interactively drill down or roll up to varying abstraction levels
tofind classification models, clusters, predictive rules, and outliers.

Example:
Table 4.1:3D view of sales data according to time, product and location
I
location= "Pokhara" Location = "Kawasotl" Location • "Dhanpdl" Locatlon="Mahendranagar"
Product Product Product Product

TIIN TV PC AP SSD TV PC AP SSD TV PC AP SSD TV PC AP SSD


Q1 854 882 89 623 1087 968 38 872 818 748 43 591 605 825 14 400
Qt 943 890 64 698 1130 1024 41 925 894 769 52 682 680 952 31 512
,.....

QI 1032 924 59 789 1034 1048 45 1002 940 795 58 728 812 1023 30 501

Q4 1129 992 63 870 1142 1091 54 984 978 864 59 784 927 1038 38 580
Data Cube Technology O CHArTEI 4 I 11
(3#
·# ~ K .. .,
~
✓ Dhr,;..il,

Mahcndr. • Jf

,... QI 605 82.S


t
~
t:,s
::,
Q2 680 952
Q
..
~

::: Q3 812 1023


~
Q4 cm 1038

1V PC AJ> SSD
Product (types}
Figure 4. l :30 data cube of sales data (rneo,ure displayed i, rupee,_,old in thousand,)
Data cubes are specifically grouped into two classifications. These are described below:
a) Multidimen sional Data Cube: Centered on a structure where the cube is patterned as
a multidimen sional array, most OLAP products are created. Compared to other
methods, these multidimen sional OLAP (MOLAP) products typically provide a better
performanc e, primarily because they can be indexed directly into the data cube
structure to capture data subsets. The cube gets sparser as the number of dimensions is
larger. That ensures that no aggregated data would be stored in multiple cells that
represent unique attnoote combinations. This in turn raises the storage needs, which
can at times exceed undesirable thresholds, rendering the MOLAP solution untenable
for massive, multi-dimen sional data sets. Compression strategies may help, but their
use may damage MOLAPs natural indexing.
b) Relational OLAP: The relational database architecture is used by Relational OLAP.
Compared to a multidimen sional array, the ROLAP data cube is used as a series of
relational tables (approximately twice as many as the number of dimensions). Each of
these columns, referred to as a cuboid, denotes a particular perspective.

Ad•antq ea of Data Cube


Following are the major advantages of data cube:
• Data cube ease in aggregating and summarizin g the data.
• Data cube provide better visualization of data.
• Data cube stores huge amount of data in a very simplified way.
• Data cube increases the overall efficiency of the data warehouse.
• The aggregated data in data cube helps in analyzing the data fast and thereby reducing
the access time.
U Datn Warehousing and Dato Mining

EF•,ICIENT METHOD FOR DATA CUBE COMPUTATION

demand that decision support queries


Data wnrchouSl'S contain huge voluml's of d,,ta. OLAP scr~crsf d t warehouse systems to support
. i f
the on. l'T o Sl'COnus. 111crc fore,, 1·t is• crucial dor ae a processing techniques.The
.
,i"
l~ answered m pre.
hi~hly rffiric11t cube rm11p11t11tio11 tccl111iq11c:;, tll'Cl':-S methods, an qhu ry sponse time and enhance the
• , tly reduce t e re
computation of all or part of ,\ d,1ta rnbl' can grca . because it may require substantial
l t'1011 1s chal1cngmg
pcrfom,ancc of Qt.AP. I \owcvcr, sm h compu •' t n overview of methods for the
computationa l time ,md st<.wagc sp,Kl'. In t1uo; . . 'Chon we prcsen a
sl: '
efficient implemcnu,tion ot tfal,1 w,nchoust' sySLCms.

The Compute Cube Operator


. t ds SQL so as to me · lude a compute cube operator. The
One approach to cube romput,,bon e, en b ets of the dimensions specified in the
t ggrega tes over a 11 su s
computt cube o~rator compu es a be t d for data cube is 2°for example,
Of bo'd1 5 0 r group-by can compu e
operation.The total number cu . th t t 1number of cuboids is computed as 23= 8
if 3 - dimen.-.ion., pn~n as 11roduct, lo(pcatiodn ant)d(I tm:on)e(:;e)a(product, location),(product, time),(location,
and the possible group-by are{ ( ), ro uc , oca , ,
timt),(pmduct, loa1tion, time)}.
O-D(apex) cuboid
all

1-D cuboids
C

BC 2-D cuboids

3D (base) cuboid
ABC
Figure 4.2: Lattice cuboid, making up of 3-D data cube with the dimensions A, 8, and C for some
aggregate mea,ure, M.
Figure 4.2 shows a 3-D data cube for the dimensions A, B, and C, and an aggregatemea sure, M. A
data cube is a lattice of cuboids. Each cuboid represents a group-by.ABC is the base cuboid,
containing all three of the dimensions. Here, the aggregatemea sure, M, is computed for each possible
combination of the three dimensions. Thebase cuboid is the least generalized of all of the cuboids in
the data cube. The mostgeneraliz ed cuboid is the apex cuboid, commonly represented as all. 11
contains onevalue-it aggregates measure M for all of the tuples stored in the base cuboid. To
drilldown in the data cube, we move from the apex cuboid, downward in the lattice. Toroll up, we
move from the base cuboid, upward. For the purposes of our discussionin this chapter, we will
always use the term data cube to refer to a lattice of cuboidsrather than an individual cuboid.
Doto C' 11lw 1'11 h11ology 0 CHAPTER 4 I 83
pntn Cubo MntorlnHz nUon (Pro-oomp utntion)
I ln'11"' lll1' th11'1 ' 1 h11 i11" 1 1111 d,1111 11111!•1t1d1i,1l11111 1111d llwy llll' ,1:1 lollnw:;:
I. No ~lat,•alull:r.,tiun1 I >11 1101 p11• 111111111111' ,11,y of the " n11n-h,1s1•" cuhoids. Thiq le.ids to
1111111'11ling 1•:-:1w11•,ivl' 111111lidinw11-ii1111,1l 11ggu•gnl1•s 11111h1• fly, whid11·,111 lw 1•.xtrcmcly slow.
2, 1'1111 ul.\h•il,,11:raliun: 1'11• 111111p11l1• ,111 nl 1h1 111hoida. l lw 11'sulli"g l,1ttkc of computed
1

, 11l•1 11ds Is 1i'll'1 tl'd lo ,1•, 1111• 11111111lw. 'I his, l111i1 ,, typil',1lly 11•quirl'S huge ,,mounts ol nwmory
t-J'•" ,, m 01d1•1 In st,111• nil ut th,• p,,,, ump1tll'd, uhmds.
3. 1'1uli:tl 1\1,lh'ri,,lb,,tinn: ~\•l,,,llwly ,ompull• ,1 p1op1•r suhi.il'I ol till' whole set of possible
, 11\)i,hls. Alt 1• 111,1tl\'i'ly ",, 111,1y, ,,mpuh• ,1 su h ,t'I ol tlw eulw, whi, h cont,1ins only some user-
sp1'1 ili,•d , 1ilt'• inn, II ~hn11ld , 11n-.1d1•1 th11•1• 1111 lm'i (,1)it/1•11tijy tltc s11/hc/ cf c11/1oitf..; to 111atainliZt';
(h) npfoil tlw 111nt,•1 i,1l1d·,I , 11/,oi,I• ,/111111.~ q111•ry I" cJc r::,s111g; ,rnd (1.') efjicie11 tly 11ptl11tc the 11111/crialized
, uboitl~ d11,i11~ /, 1,1,I ,111,l 1,11,•s/1.
A c-cll in tlw t,.,~,• cuboid is ,\ b,tSl' cell A n•ll fmm ,1 non b,,-,c" cuboid is .1n aggregate cell. An
11

aggrl'g:lh' c,•11 ,1g~n.•g.itl•s ovc, o,w or mmc dimensions, where c.1ch aggrcgateddim ension is
11
indic-ak•d by a "• in llw c1.•II nolalion. Su pposc we have ,rn n-dimcnsiona ldnta cube. Let
a• (n1, n1., ...,11,., m1•11s111cs) be ,1 c-l'll from orw of the cuboids m,1kingup the data cube. We say that a is
on m-dimension al cell (i.e., from an m-dimcnsion, \lcuboid) if exactly 111 (111 s 11) values among {a,,
a21 •••,n.J nre not '' •" . If 111 -= 11, then ais ., base cell; otherwise, it is an aggregate cell (i.e., where
m<u).for example, if we take 30-cubc all J-0, 2-0 cells are aggregate cells and 3-0 is Base Cell. 1-D
and 2-D cells arc ancestors of 3-0 Cell whereas 3-D is descendent cell.
Oifferentcube rnaterializatio n include: full cube, Iceberg cube, Closed cube and Shell cube.
J. Full cube:ln order to ensure fast OLAP, it is sometimes desirable to precompute the full cube
(i.e., all the cells of all of the cuboids for a given data cube). This, however, is exponential to
the number of dimensions. Thal is, a data cube of n dimensions contains 2° cuboids. There are
even more cuboids if we consider concept hierarchies for each dimension. In addition. the size
of each cuboid depends on the cardinality of its dimensions. Thus, precomputati on of the full
cube can require huge and oftenexcessive amounts of memory.
2. Iceberg cube:lt contains only those cuboid cells whose measures satisfies the aggregate or
iceberg condition. The aggregate condition could be, minimum support or a lower bound on
average, min or max. ll is called an iceberg-cube because it contains only some of the cells of
the full cube like the tip of an iceberg. The purpose of the Iceberg-cube is to identify and
compute only those values that will most likely be required for decision support queries. The
aggregate condition specifies which cube values .1re more meaningful and should therefore be
stored. ·1 his is one solution to the problem of computing versus storing data cubes.
3. Closed cube: It consists of only closed tells. A cell llMt has no degcendant cell th,,n it is closed cell.
4. Shell cube: Another 'ilrilll•gy for p,uti,1l m,1t1•riah,ation 1s to pre-compull' only portion and
fragm(•nts of Uw cul><• 'iiwll (cuboids involving ~mall no. of dimensions such as 3 to 5), based
on the cuboid'> of intl•n•st. For cxampll',slwll fr,1gmenl tube ABC contains 7 cuboids:{A, B, C.
AB, AC, BC, AIK}

GENERAL STRATEGIES .~OR CUBE COMPUTATION

With different kinds of cubes ns dl•scribed ,,bow, we can expect that there are a goodnumber of
methods for efficient computation. In ge1wr,,l, there are two basic data structures used for storing
cuboids. Relational t,1blcs are used ,1s the basic data structure for theimplement ation of relational
N Data Warehousing and Data Mining

OLAP (ROLAP), while multidimensional arrays are usedas the basic data ~tructure in
multidimensional OLAP (MOLAP) Although ROI AP andMOLAP may each explore ~erent cube
· · t· ,, tr' ks" b h ed among the different data
computation techruques, some opt1m1za ton 1c can e s ar
· · · hn' f the efficient computation of
representations. The followmg are genera l optim1zallon tee 1ques or
data cubes.
1. Sorting, hashing, and grouping: Sorting, hashing, and grouping operations should be app~ed
to the dimension attributes in order to reorder and cluster related tuples. In cube computation,
· h th e set of dimension values
aggregation 1s perfom,ed on the tuples (or cells) that s are e sam ·
· · · · d · 0 perations to access and group
Thus, it is important to e.,plore sortmg, hashing, an grou ping
· · f ch t s For example to compute total
such data together to facilitate computation o su aggrega e · '
· is ·
· more effic1ent t tuples or cells by branch, and then
sales by brand,, dav, and product, 1t to sor .
· · · th d t ame Such implementations can be
by Jay, and then group them according to epro uc n ·
e.xtended to data cube computation.
'Ibis technique can also be further extended to perform:
• Shared-sorts: Sharingsorting costs across multiple cuboids when sort-based methods
are used
Shared-partitions: Sharing the partitioning cost across multiple cuboids when hash.

based algorithms are used.
2. Simultaneous aggregation and caching intermediate results: In cube computation, it is
efficient to compute higher-level aggregates from previously computed lower-level
aggregates, rather than from the base fact table. Moreover, simultaneous aggregation from
cached intermediate computation results may lead to the reduction of expensive disk I/0
operations.
For example, to compute sales by branch, we can use the intermediate results derived from the
computation of a lower-level cuboid, such as sales by branch and day. 1his technique can be
further extended to perform amortized scans i.e., computing as many cuboids as possible at
the same time to reduce disk reads.
3. Aggregation from the smallest child, when there exisbnultiple child cuboids: When there
exist mulbple child cuboids, it is usually more efficient to compute the desired parent (i.e.,
more generalized) cuboid from the smallest,previously computed child cuboid.
For example, to compute a sales cuboid, Chra11ch, when there exist two previously computed
cuboids,Cf!,much ~,, and Ctb,nnc/1, ilrm/, it is obviously more efficient to compute C bnmd, from the
former than from the latter if there are many more distinct items than distinct years.
4. The Apriori pruning method can be explored to compute iceberg cubes efficiently: The
Apriori property, in the context of data cubes, states as follows: If a given cell does not satisfy
minimum support, then no descendant (i.e., more specialized or detailed version) of the cell will satisfy
minimum support either. This property can be used to substantially reduce the computation of
iceberg cubes.
Recall that the specification of iceberg cubes contains an iceberg condition, which is a
constraint on the cells to be materialized. A common iceberg condition is that the cells must
satisfy a minimum support threshold such as a minimum count or sum. In this situation, the
Apriori property can be used to prune away the exploration of thecell' s descendants.

Dala Cube Technology 0 CHAPTER 4 1 15
DATA CUBE COMPUTATION METHODS

Data cube computation is an essential task in data warehouse implementation. The pre-computation
of all or part of a data cube can greatly reduce the response time and enhance the performance of
oLAP, However, such computation is challenging because it may require substantial computation
time and storage space.This section explores efficient methods for data cube computation which are
as follows:
1. The multi-way array aggregation (Multi Way) method for computing full cubes.
z. A method known as BUC, which computes iceberg cubes from the apex cuboid downward.
3. TI,e Star-Cubmg method, which integrates top-down and bottom-up computation.
4. High dimension OLAP

The Multi-Way Array Aggregation (Multi-Way) Method for Computing Full


Cubes
It is array-based "bottom-up" algorithm. The Multi-way Array Aggregation (or simply Multi-Way)
method computes a full data cube by using a multidimensional array as its basic data structure. It is
a typical MOLAP approach that uses direct array addressing, where dimension values are accessed
via the position or index of their corresponding array locations. Hence, Multi-Way cannot perform
any value-based reordering as an optimization technique. A different approach is developed for the
array-based cube construction, as follows:
• Partition the array into chunks: A chunk is a sub-cube that is small enough to fit into
the memory available for cube computation. Chunking is a method for dividing an n-
dimensional array into small n-dimensional chunks, where each chunk is stored as an
object on disk. The chunks are compressed so as to remove wasted space resulting from
empty array cells.
• Compute aggregates by visiting (i.e., accessing the values at) cube cells: The order in
which cells are visited can be optimized so as to minimize the number of times that
each cell must be revisited, thereby reducing memory access and storage costs.
Aggregation Strategy
1. Partitions array into chunks
2. Chunk: a small sub-cube which fits in memory
3. Data addressing
4. Uses chunk id and offset
5. Multi-way Aggregation
6. Computes aggregates in multi-way
7. Visits chunks in the order
a. to minimize memory access
b. to minimize memory space
The best traversing order to do multi-way aggregation:
• Method: the planes should be sorted and computed according to their size in ascending
order.
• Idea: keep the smallest planes in the main memory, fetch and compute only one chunk
at a time for the largest plane
• Data Warehousing and Data Mining


.
Limitation of the method: computing well on y
1 for a small number of dimensions
t tion" and iceb
. "bottom-up compu a erg Cti~
• If there are a large number of dimensions,
computation methods can be explored . and c. The 3-D .
. • the three dimensions A, B, array ts
Example: Consider a 3-D data .uray cont,unmg .. d . t chunks as shown in figure 4
·, partillone m O 64
nd a B is organized into f 3
· .
partitioned into small chunks. Herc, t1w arrclY
15

• • . . 1 si1ed partitions ao, a1, a2, a J. o11r


Dimension A is orgamzed mto four l'<JUcl • • d • to four partitions each. Ch1,n1.._
" and Dinwnsions
equal---ized partition, . · ·1arly organize m .
C are suru ~"'5 1,

~ '"' b c
, pond to the .-;ub-cubesaoboCo, 111boC0, .... .... , 03 3 3'
respective1y.
2, .....,~cone.
H

A-B Plane

* * * *

* * * *
!< * *
13 14 15

b: 9 10 11 12
B
bi 5 6 7 8

/ bo 2 3 4
ao a1
A
a2 clJ

Figure 4.3: A 3-0 array for the dimensions A, 8, C, organized into 64 cltunlcs.

Suppose that the cardinality and size of the array of A, B, and C is 40, 400, and 4000, respectively.
Thus, the size of each partition 1s 10, 100, and 1000, respectively. Full materialization ofthe
corresponding data cube involves the computation of all of the cuboids defining thiscube i.e., {apex,
A, B, C, AB, AC, BC, ABCJ.

Let's look how the multiway array aggregation technique is used in this computation?There are
many possible orderings with which chunks can be read into memoryfor use in cube computation.
Consider the ordering labelled from 1 to 64, shown in figure 4.3.Suppose we would like to compute
the boeo chunk of the BC cuboid. We allocate space for this chunk in chunk memory. By scannil18
chunks 1 to 4 of ABC, the boco chunk is computed. That is, the cells for boco are aggregated over ao to
a3•The chunk memory can then be assigned to the next chunk, b1co, which completes its aggregatioll
after the scanning of the next four chunks of ABC: 5 to 8.Continuing in this way, the entire BC cuboid
Data Cube Technology 0 •
can be romputed. Therefore, only one chunk of BC needs to be in memory, at a time, for the
computation of all of the chunk, of BC.In computing the BC cuboid, we will have seamed each of
t11t6'chunks.
re 1.-I l'HClnning • 11 of theee chunb for the computation of cuboi•, such as AC and AB, the
,,,.n;.-y a,rrqn,t.lion or limidton«Jt11 ll~Hon idea comes in.For example, when chunk 1 (i.e.,
~ II being acanned (say, for the computation of the 2-0 chunk boea of BC, a described above), all
tl the odler' 2-D chunb fflating to . -0 can be stmuJtaneou,Jy computed. That t., when ~ is
..._ onned, eadl of the thNe chunb, baco, lloCo. and loba, on the three 2-0 aggregation plalB, BC
AC and AB, ahould be computed then u well In other word9, muJtiway 1.1m.putation
~ aggregates to Heh of the 2-0 planes while a 3-0 chunk is in memory.
Now left look at how different orderlnp of chunk ICalllWII and of cuboid coa.patation can affect
lllt ..,..0 data cal,e compatatlon efftdency?The po98ible orderings are:
a) Ordering from 1, 2, 3, 4, ......... 64. By acanning chunks of 1 to 4 of ABC the lroc. is
computed (agpega1- over ,re, to ,.,). Multiway aggregation med to avoid rescan al
particular chunks. lo avoid 3-0 thunk more than once into memory, ta -••• •
aeqailement for holding all relennt- 2-D plaw
• whole AB plane+ one row of AC+ one chunk of BC plane
• 40 X 400 + 40 ,c 1000 + l(X)' X 1000 • 1,56_#)0 melld~ dDill.
Orderiltg .from 1, 17, 3.1, -49, 5, 21, 37, Si. 111e wfafmu-. 4q.dllllDIIII fa: t IC a •

11
b)
aelffntZ.O plwl
• whale BC plw + - - at AC+ one=
• 0 X «JOO ♦ 40 X 1000 ♦ 10 ,c 100 • 16,'l
iw, the IIIOlt 4fflrlent one ii 1, 2, ......-6' .
anits.

R Data Warehousing and Data Mining
here we can move from detailed, low
highly aggregated cell to lower, more detailed cells. Roll-up- w ·
level cells to higher level, more aggregated cells. .W BUC
, t tion. Unlike Multi- ay, constructa
BUC is an algorithm for sparse and iceberg cube compu a . f'gure 4 5 Partitioning analy'1.;.
b ·d as shown in 1 · • ~
the cube from the apex cuboid toward the b a~ cu 01 . s·nce recursive analysis in BUc
. . BUC' cube computation. I
and operation are examinahon costs m s . nsive BUC uses the bottom-up
. . .. 1'd d gregation are expe · .
does not cut the mput sne, both d1, e an ag . to A-pnori prurung strategy. If
. mputation by recumng
approach that allows to prune unncce~sary co . y of its descendants. The Iceberg
• . . l then neither can an
a l"iven cell does not satisfy mnumum suppor' . diti'on
cube problem is to compute all group-bys at sa 5fy an iceberg con
o· th ti ·
' All

Figure 4.5: IUC's exploration for the computation of a 3-D data cube. Note that the computation starts
/ from the apex cuboid

BUC Algorithm Explanation:


• Initially, the algorithm takes relation (set of tuples) as input and aggregates the entire
input and writes the resulting total.
• Divide dimensions into partitions: for each dimension the inpu t is partitioned on
dimensions and count the total no. of tuples for each distinct values of dimension. Each
distinct value of dimensionfrom its own partition.
• Test the partition for minimum support to perform iceberg pruning:
o Apriori property: if a cell does not satisfy the minimum support, then neither can
any its descendants.
o Iceberg pruning: if the no. of tuples in the partition does not satisfies(i.e.,<) the
minimum support, then its descendants can be pruned .
o U the number of tuples in the partition satisfies(i.e., >=) minimum support then
the partition descends one level deeper into the lattice and so on. On return
continue with next partition for dimension.
• After all the partitions have been processed, the entire process is repeated for each of
the remaining dimensions.
Example:Let's look how BUC constructs the iceberg cube for the dimensions A, B, and Oiaving 3 as
minimum support count?

L

Data Cube Technology O CHAPTER 4 I 19
distinct values, b1, bi, bJ, b4;
Suppose that dimension A has four distinct values, a1, a2, a3, a1; B has four
and C has two distinct values, c,, Cz To construct the iceberg cube,
we must compu te every
that have 3 tuples). To do
combination of the grouping attributes that satisfy minimum suppo rt (i.e.,
g to the cell
.so, BUC scans the input, aggregating the tuples to obtain a count for all, correspondin
(♦, *, *).
g one tuple
• Starting with A dimension value, a1, the a1 partition is aggregated, creatin
for the A group-by, corresponding to the cell (a 1, *, *)
ive call is made
• Suppose (a1, *, *) satisfies the minimum support, in which case a recurs
on the partition for a1
um suppo rt
• Suppose the cell count for (a1, b1, *) is 2, which does not satisfy the minim
partitioning this
so, BUC prunes any further exploration of (a1, b1, *).That is, it avoids
cell on dimension B
• It backtracks to the a1, partition and recurses on (a1, b2, *), and so on
b1
C1
bi
a1 C2

hJ
b4
a2

clJ

c4

data set.
Figure 4.6: BUC partitioning snapshot given an examp le 30
to skew in the data. Ideally,
The performance of BUC is sensitive to the order of the dimensions and
sions should be proces sed in
the most discriminating dimensions should be processed first. Dimen
r the partitions are, and thus,
order of decreasing cardinality. The higher the cardinality is, the smalle
r oppor tunity for prunin g.
the more partitions there will be, thereby providing BUC with greate
it is for prunin g.
Similarly, the more uniform a dimension is (i.e., having less skew), the better

m-up
The Star-Cubing meth od, which integrates top-down and botto
computation
explores shared dimen sions.
Star Cubing integrate the top-down and bottom-up methods. It
means cuboid AC has shared
Example, dimension A is the shared dimension of AC and C. AC/C
cuboid C is compu ted
dimensions C. Star cubing allows for shared computations. e.g.,
but with the bottom -up sub-
simultaneously as AC. Star Cubing aggregate in a top-down manne r
sions grow in bottom -up
layer underneath which will allow Apriori pruning. Its shared dimen
fashion. As shown in figure 4.7.
■ Data 'Warchousmgand Data M1mng
All

BC/BC

. ·th
fivure ◄.7: Stor-C1.1bing: bottom-up computation w1 to p-down expansio n of shared dimension s.

High Dimens ion OLAP


.
~ bes facilitate fast O1..AP m a m u1ti.dimens1on
. al data space However, a full data cube of high
·
1..14... cu.. • · tation time Iceberg cubes provide
~ ~ " ' needs massive storage space and unrealisti c compu . . ·. .
a mo.-e feas;b?e alternative, as we have seen, wherein the iceberg condition_15 used to ~
the
tdatio:"! o! onlv a subset of the full cube's cells. However , although an 1ce~g cube
15 ~er
CXCf •
and re:?l.lL"eS e;-: computati ·on tim.e than 1·ts correspon ding full cube it is not an ultimate solution.
,
Fe:- • the
cuhcnd ~ (:u:: .. tation and storage of the iceberg cube can still be costly. For example, if the base
,cl{,o), passes minimum support, it will generate 260 iceberg cube ce~Secon d,_it
15 diffiaili to determine anapprop riate iceberg threshold. Setting the threshold too low will result m

/ a huge cube,v.ilereas setting the threshold too high may invalidate many useful applications.
ThinLan iceberg cube cannot be incrementally updated. Once an aggregate cell falls belowthe
ioeberg threshold and is pruned, its measure value is lost. Any incremen tal updatew ould require
recomput ing the cells from scratch. This is extremely undesirab le for largereal- life applications
v.itere incremental appendin g of new data is the norm.
One possible solution, which has been implemented in some commerc ial data warehou se systems,
is
to compute a thin cube shell.For example, we could compute all cuboids with three dimensio ns
or
less in a 60-dimensional data cube, resulting in cube shell of size 3. The resulting set of cuboids
would require much less computat ion and storage than the full 60-dimen sional data cube.However,
there are two disadvant ages of this approach . First, we would still need to compute
60(:: 3 -:- «c2 + 60 = 36,050 cuboids, each with many cells. Second, such a cube shell
does not support
high dimensio nal OLAP because (1) it does not support OLAP on four or more dimensio ns and (2) it
cannot even support drilling along three dimensions.
Instead of computin g a cube shell, we can compute only portions or fragment s of it. This section
discusses the shell fragment approach for OLAP query proce~in g. It is based on the following key
observati on about OLAP in high-dim ensional space. Although a data cube may contain many
dimensio ns, most OLAP operation s are performe d on only a small number of dimensio ns at a tilne,
In other words, an OLAP query is likely to ignore many dimensio ns (i.e., treating them as irrelevant),
fix some dimensio ns (e.g., using query constants as instantiat ions), and leave only a few to be
manipula ted (for drilling, pivoting, etc.). This is because it is neither realistic nor fruitful for anyone
to compreh end the changes of thousand s of cells involving tens of dimensio ns simultane ously in
a

1
high-dim ensional space at the same time. -
Data Cube Technology O CHAPTER 4 I 91
RACTERIZATION
;\'f1'RJ1JU1'E O RIENTED INDUCTION FOR DATA CHA
~::::============================~~~~~~~========
d In <luction (AOI) approach to data generalization and summ ariza tion - base d
he ;\t1ribt1ll•· Oricntc
1 s before the introduction of the data cube
cli.tr,1ctcriz,1tion was fir l proposed m 1989, a few years
idered as a data warehouse - based, pre
,,ppro,ich The cl,ila cube appr oach can be cons
rms off-line aggregation before an OLAP or
,on1put.itio11c1l oriented, materialized approach. It perfo
s other hand, the attribute-oriented indu ction
d,it,1 rni111ng query is ubmilted for processing. On the
ase query - oriented, generalized - based,
appro.ich, ,11 kw;L in its initia l proposal, a relational datab
distinguishing the two
line da ta <malysis technique. However, there is no inherent barrier
011
e pre-compu tation.
,,pproaches based on online aggregation versus offlin

Basic Prin cipl es of Attr ibut e Ori ente d Ind ucti on


tion in relational databases is summarize d as
Ascl of basic principles for the attribute-oriented induc
follows
ding dimensions, and the resul t is the
1. Data focusing: Analyzing task-relevant data, inclu
initial relation.
Attribute-removal: To remove attribute A if there
is a large set of distinct value s for A but
2.
either
a. There is no generalization operator on A, or
attributes.
b. A's higher-level concepts are expressed in terms of other
ct values for A, and there exists a set of
3. Attribute-generalization: If there is a large set of distin
and generalize A.
generalization operators on A, then select an operator
Attribute-threshold control: Typical 2-8, specified/
default.
4.
Generalized relation threshold control: To control
the final relation/ rule size.
5.

G BETWEEN DIFFERENT
MINING CLASS COMPARISON: DISCRIMINATIN
CLASSES

g a single class described or chara cteri zed,


In many applications, users may not be interested in havin
ares or distinguishes one class from othe r
but rather woul d prefer to mine a description that comp
n mines descriptions that disti ngui sh a targe t
comparable classes. Class discrimination or compariso
t and contrasting classes must be comp arab le in
class from its contr astin g classes. Note that the targe
utes. For example, the three classes, perso n,
the sense that they share similar dimensions and attrib
for class comparison perfo rmed as follows:
address and item are not comparable. The procedures
the databases and data ware hous es is
1. Data Collection: The set of associated data from
the targe t class and contrasting class(es).
collec ted by quer y processing and is partitioned into
2. Dimension Relevance Analysis: When many dime
nsions are to be processed and is requ ired
dimension relevance analysis shou ld be
that analytical comparison shou ld be performed, then
ant dimensions are inclu ded in the
performed on these classes, and only the highly relev
further analysis.
12 Datn We.rehousing and Dntn Mining

3. Synchronous Generalization: The process of generalization is pt>rforml'd upon the tarni•t


class to the level controlled by the user or expert specified dimension threshold, which rc.'lul~
in a prime target class relation/ cuboid. The concepts in the contrc1sting class or classes ilte
generalized to the same level as those in the prim<.• lt1rgcl class relation/ cuboid, forming th~
prime contrasting cl,.1ss rcl,ltion/ cuboid
4. Presentation of the derived comparison: The resulting class compt1rison description can ~
visualized in the form of t,.1blcs, chJrls, and rules

A,Ezercis)
1. Explain about Oass Comparison Methods and Implementation?
2. Explain about Presentation of Class Comparison Descriptions
3. Discuss Oass Description: Presentation of Both Characterization and Comparison
4. Explain Mining Descriptive Statistical Measures in Large Databases:
5. Describe efficient method for data cube computation with suitable example.
6. What is Cube materialization? Define Full cube, Iceberg cube, closed cube and Shell cube.
7. Describe general strategies for cube computation. Explain.
8. Describe attribute-oriented induction for data characterization.
9.
Define mining class comparison: Discriminating between different classes with suitable
I example.
10.
Describe Multi-Way Array Aggregation (Multi-Way) Method for Computing Full Cubes with
suitable example.
11. Describe Advantages and disadvantages of Data Cube.
12. What are the application areas of data cube? Explain.
13.
Describe different data cube computation methods with suitable example.
14.
Describe Basic Principles of Attribute Oriented Induction.
15. Describe Star-Cubing method with suitable example.
16.
Describe BUC method for computing iceberg cubes from the apex cuboid downward.
17. Describe High Dimension OLAP.
18. Describe Multidimensional Data Cube with suitable example.
19. Define Relational OLAP with suitable example.
20.
Describe different approach is developed for the array-based cube constructi'
on.

□□□

You might also like