0% found this document useful (0 votes)
3 views11 pages

MAC - Conceptual Data Modeling for OLAP

Uploaded by

ashupratster2
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views11 pages

MAC - Conceptual Data Modeling for OLAP

Uploaded by

ashupratster2
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

MAC: Conceptual Data Modeling for OLAP

Aris Tsois Nikos Karayannidis


Knowledge and Database Knowledge and Database
Systems Laboratory, NTUA Systems Laboratory, NTUA
[email protected] [email protected]

Timos Sellis
Knowledge and Database Systems Laboratory
Department of Electrical and Computer Engineering
National Technical University of Athens (NTUA)
Zografou 15773, Athens, Greece
[email protected]

The typical data flow path involves the gathering of data


from various sources into data warehouse systems and
Abstract then the usage of those data in the multidimensional
analysis process through the use of OLAP applications.
In this paper we address the issue of conceptual modeling Multidimensional analysis mainly involves the
of data used in multidimensional analysis. We view the computation of aggregated information using a large
problem from the end-user point of view and we describe volume of detailed data. The information is analysed
a set of requirements for the conceptual modeling of real- based on its detailed or derived properties (dimensions)
world OLAP scenarios. Based on those requirements we using an almost static business model (hierarchies). The
then define a new conceptual model that intends to reader is referred to [ChDa97] [Inmo96] [Kimb97]
capture the static properties of the involved information. [Olap97] for an overview of Data Warehousing and
In its definition we use a minimal set of well-understood OLAP. In the following we will assume the reader to be
OLAP concepts like dimensions, levels, hierarchies, familiar with the terminology used in those areas.
measures and cubes. The central concept of the model is
the Multidimensional Aggregation Cube (MAC), which A fundamental issue faced by vendors of OLAP
gives a broad and flexible definition to the notion of a applications as well as by researchers in the OLAP
multidimensional cube. We evaluate our model against domain is the modeling of data. The well-studied
other existing multidimensional models and show that conceptual and logical models used in other database
MAC offers a unique combination of modeling skills. Our areas, like the E/R model or the relational model, do not
main contribution is the definition of the basic concepts of seam to be sufficient for the OLAP case
our model; although the set of requirements and the ([Kimb96][TBC99][S++98][Kimb97]). Vendors have
evaluation of all related models against those adopted various models, while standardization bodies and
requirements represent an additional result. researchers have developed and studied additional
models. All those models share some common concepts
like measures or hierarchies but there is still no formally
1 Introduction defined and widely accepted (logical or conceptual) data
model. As proved by the history of the relational model a
In the last years On-Line Analytical Processing (OLAP) common data model is the key for the collaboration and
[Codd93] has become a major research area in the the rapid progress in an area.
database community [ChDa97]. The OLAP research is
tightly coupled with the research in data warehouses, In this paper we address the problem of modeling real-
which are considered to be the information sources based world OLAP scenarios at the conceptual level. The
on which On-Line Analytical Processing is performed. current common practice is to use the well-known E/R
model [BCN92] and then to annotate the schema with
The copyright of this paper belongs to the paper’s authors. Permission to copy any additional OLAP specific information. Still, various
without fee all or part of this material is granted provided that the copies are not authors argue that the E/R model is not appropriate for
made or distributed for direct commercial advantage. OLAP scenarios since concepts like dimensions,
Proceedings of the International Workshop on Design and hierarchies and cubes can only be partially represented.
Management of Data Warehouses (DMDW'2001) As a result, two publications ([TBC99] and [S++98])
Interlaken, Switzerland, June 4, 2001 already proposed extensions to the E/R model for the
(D. Theodoratos, J. Hammer, M. Jeusfeld, M. Staudt, eds.) multidimensional paradigm. Their approach is mainly
http://sunsite.informatik.rwth-aachen.de/Publications/CEUR-WS/Vol-39/
suitable for the ODS (operational data store) part of data

A. Tsois, N. Karayannidis, T. Sellis 5-1


warehouses as they concentrate on the representation of into the data warehouse from the OLTP system. For each
the source-detailed data. sales transaction the OLTP system records the following
information:
In this paper we consider a slightly different approach, • The date of the transaction.
where the information used in multidimensional analysis • The cashier ID where the transaction took place.
is the primary target of our modeling concepts. The • The ID of the products being sold.
information used in such an analysis process is mainly • The customer ID.
aggregated data at various aggregation levels, or • The sales price for each product being sold.
combination of such levels. Furthermore, the dimensions,
the particular aggregation levels as well as the various We assume that all the above information is somehow
hierarchies defined on dimensions represent information stored in the data warehouse. We are not going to talk
used during the analysis. about the design process of the data warehouse, neither
about how data is loaded from the OLTP system since our
In order to define a useful conceptual model we first model does not address those issues. The MAC model,
investigate a set of example queries and derive a list of which we propose, is mainly suitable for the users of the
modeling requirements. Based on those requirements we data warehouse, the persons that analyze the information
then define the concepts of our model and their semantics. through the use of an OLAP application.
The central concept of our model is the Multidimensional
Aggregation Cube (MAC), which is equivalent to an n- As described by a plethora of OLAP papers [Mendel], the
way relationship relating measure values to a set of multidimensional analysis is mainly based on drill-down,
dimension values. A careful definition of dimension roll-up, slice and dice operations that are performed on a
values allows a single MAC to represent measure values multidimensional view of data. Measures values are
of arbitrary aggregation levels. This is an essential selected and aggregated using various predefined
difference with respect to the other conceptual models and dimensions, dimension levels and hierarchies. The
can be used to simplify the schema of the various OLAP dimension levels, the aggregation paths defined by
scenarios. An additional novelty of our model is the hierarchies, the dimensions and the measures are the main
explicit modeling of analysis paths, a feature quite concepts used in the analysis. For our example scenario
important for OLAP applications. assume that the analysis is performed on the sales price
(price of sold items) using the hierarchies defined in
Generally speaking, the concepts used in Figure 1.
multidimensional analysis are mapped directly to
corresponding concepts of the MAC model. As a result, $QDO\VLV 3DWKV 'LPHQVLRQV

the proposed MAC model allows OLAP scenarios to be 5HVLGHQFH 5HVLGHQFH 5HVLGHQFH
3 &XVWRPHU
modeled in a natural and straightforward way. $UHD &LW\ 5HJLRQ

Furthermore, its abilities to model complex dimensions &OLHQW

and hierarchies and the broad definition of cubes makes it 3 &XVWRPHU 3URIHVVLRQ

suitable for highly complicated OLAP applications.


3 'DWH 0RQWK <HDU 7LPH
The remainder of this paper is structured as follows:
section 2 provides a set of OLAP specific modeling 3 &DVKLHU 6WRUH 6WRUH $UHD 6WRUH &LW\ 6WRUH
5HJLRQ
requirements defined through examples. Section 3 defines /RFDWLRQ
the basic concepts of the proposed MAC model as well as 3 6WRUH :DUHKRXVH :DUHKRXVH :DUHKRXVH :DUHKRXVH
$UHD &LW\ 5HJLRQ
their semantics. Section 4 provides an overview of related
work and describes the results of evaluating 12 3URGXFW 3URGXFW
3 3URGXFW *URXS &ODVV
multidimensional data models published in research
papers. Finally, section 5 concludes the paper and
3 3URGXFW $GYHUWLVLQJ ,WHP
presents our future work intensions. PHWKRG

2 Requirements Through An Example 3 3URGXFW %UDQG

In this section we will present a set of requirements that


we believe to be of key importance for a conceptual Figure 1: The analysis paths
model used in multidimensional analysis and OLAP
applications. We will present those requirements through The paths shown in Figure 1 are grouped into four distinct
the use of an example scenario. The scenario is based on a dimensions. The most detailed level of each dimension
real Data Warehousing / OLAP project in the corresponds to a basic property of a product’s sales price
development of which we have been involved. as recorded by the transactions of the OLTP system. For
example the customer ID is used for the Customer level
Assume the following example: A chain of stores selling of the Client dimension.
electrical home appliances has built a data warehouse in
order to analyze its sales data. The sales data are loaded

A. Tsois, N. Karayannidis, T. Sellis 5-2


Each path is constructed out of two or more levels and the since a product can be simultaneously advertised by
grouping/classification relationships that link those levels. various methods (newspapers, TV, radio, etc.). Also,
The paths represent sequences of valid roll-up and drill- some members of a level may have no links to the more
down operations that can be performed during the detailed level. For example we may have a member of the
analysis. For example the products can be grouped by Store Area level that is not linked to any members of the
brand using the grouping/classification relationship Store level. This would mean that at the given time there
defined by the path P8. This relationship links each is no store in that area or that the system does not know or
product in the Product level to some brand in the Brand does not want to show which stores belong to this area.
level.
The above examples show that a natural model of
The designer or the OLAP application defines the schema grouping/classification relationships might involve n-way
of the dimensions and hierarchies mostly at design time relationships among levels. Also those relationships might
but an ad-hoc query might need to define its own analysis not reference all members of the involved levels.
path. The levels, the grouping/classification relationships
and the paths are defined based on the needs of the Let us now consider four queries that the analysts could
analysis process. ask. The first query is:

For example, one possible analysis path is to drill-down Q1: Give me the sum of sales for the year 2000 per
from Warehouse Area to Store level. This drill-down Month, Product Group, Product Class, Store City and
operation could reveal the stores supplied by warehouses Store Region.
of a particular area. The designer of the OLAP application
has defined this path (P5) but has decided to leave Note that the above question requires aggregation on
Cashier level out of this path. This decision can mean that several levels of the same path. For example it requires
when doing roll-up and drill-down operations on this path the sum of sales for each product group and also the sum
it is not meaningful to drill-down to the Cashier level. of sales for each product class. The query result
represented in a grid fashion could look like Figure 2.
From the above discussion one can realize that a
conceptual data model suitable for multidimensional -DQXDU\ )HE

analysis should provide means to define: &LW\


6WRUH 5HJLRQ

&LW\ &LW\
6WRUH 5HJLRQ

&LW\

1. dimension levels,
*URXS *URXS

2. grouping/classification relationships (that link those


3URGXFW &ODVV

levels) and

3. analysis paths.

Those first-class concepts of the multidimensional


*URXS *URXS

analysis must have an appropriate and straightforward


3URGXFW &ODVV

representation within a conceptual model. Note that the

dimension levels are in fact attributes that can


characterize the measure being analyzed and the analysis

paths are valid sequences of drill-down/roll-up operations.

According to the first path (P1) of our example scenario, Figure 2: Example result for query Q1
for each customer we have the residence address in the
form of area, city, and region. Assume now that for some Now consider the case where the analyst uses two paths
customers the city attribute is not set (perhaps because of the same dimension for the classification and grouping
their residence area does not belong to any city). In this of sales data.
case the residence area of the customer will be linked by
the grouping/classification relationship directly to a Q2: Give me the sum of sales for year 2000 per
residence region and not to a residence city as it happens customer Profession and Residence Region.
with most of the other residence areas. Generally
speaking a grouping/classification relationship may The two levels used for grouping, Profession and
involve more than two levels since a members of a level Residence Region are in fact independent although they
may drill-down to members of different levels. both belong to the same dimension. So, grouping on both
of them at the same time defines a two-dimensional space.
There are also some other aspects of the If the two levels where related, like Store City and Store
grouping/classification relationship that we found to be Region are in the previous query, we would have a one-
important. In some cases a member of a level is linked to dimensional space.
more than one members of the next level. For example a
member of the Product level will probably be linked to The next query shows the necessity of supporting multiple
more than one members of the Advertising Method level measures defined over the same set of dimensions.

A. Tsois, N. Karayannidis, T. Sellis 5-3


Q3: Give me the sum of sales, the maximum sale provide a highly expressive and intuitive modeling
value and the number of sales per Store and Month methodology for the information used in
for the year 2000. multidimensional analysis.

In most of the cases some of the measures will be The MAC model uses concepts that are close to the way
functionally dependent on other more primitive measures. OLAP users perceive the information. The model tries to
Generally speaking, a set of primitive and derived be expressive providing the means to model complicated
measures can simultaneously be required for an analysis real-world scenarios while using a minimal set of
process. Finally, we present a query with a somehow concepts that remain as simple as possible. The MAC
more complicated selection condition: model describes data as dimension levels, drilling
relationships, dimension paths, dimensions, cubes and
Q4: Give me the sum of sales per Month, Store Area attributes.
and Brand selecting only the store areas that have
increased their total sales for the year 2000 by more Dimension levels represent classes of dimension members.
than 10 percent from the previous year. Each dimension member represents some instance of a
real-world property that an OLAP measure may have.
This query requires calculation of the total sales value per Distinct dimension levels can be related by means of a
store area for the years 2000 and 1999 and then the drilling relationship. A drilling relationship indicates that
selection of the areas whose sum of sales value for the there is a semantic relationship among the involved levels
year 2000 greater than 110% of their sum of sales value and describes how the dimension members of the children
for the year 1999. For those store areas the query requires levels can be grouped into sets that correspond to
the sum of sales calculated by Month and Brand. This is a dimension members of the parent level.
typical query that performs selection based on aggregated
data at a different level than the data required for its A set of drilling relationships can form a dimension path
output. if several structural requirements are met. A dimension
path defines a meaningful composition of drilling
Our experience shows that all the above queries are relationships and is used to model a valid sequence of
common OLAP queries. We believe that a conceptual abstraction operations (drill-down/roll-up). One or more
model suitable for multidimensional analysis should dimension paths that share common levels can form a
accommodate queries like the above. This means that the dimension.
structure of the query result as well as the structure of any
other information involved in the query definition must be Finally, we define multidimensional aggregation cubes
easily represented by the concepts of the model. The (MACs) as a relationship among the domains of one or
above requirement comes as a result of our intention of more dimensions. A MAC can have one or more
having a conceptual model that can efficiently represent measures. Each one of those can be considered as a
all kinds of information handled by OLAP applications simple and atomic attribute of the relationship represented
and not only the raw (source - detailed) data. Since we are by the MAC. An instance of a MAC is called a MAC cell
only talking about a conceptual data model we do not or simply a cell. We now give the complete definition of
require the model to represent the functional aspects the above terms and provide examples on how they are
(operators and functions) of the queries but we limit our used. In the following we will use the simple term cube to
requirements to the static data involved in those refer to a multidimensional aggregation cube
computations.
3.1 Dimension Levels
Based on our example queries we derive the following
requirements: a good conceptual model should be able to A dimension level is a set of dimension members. The
define aggregations on arbitrary combination of levels (of dimension members are the most detailed modeling
different paths) even if those levels belong to the same concepts of our model and represent instances of real-
dimension as well as aggregation on a set of levels world properties that OLAP measures may have. In our
belonging to a particular path. Furthermore, the model example scenario, the sale price is one measure of our
should allow multiple measures to be defined for a given multidimensional analysis. A property of this measure is
set of dimensions and in some cases represent them in one the location where the sale took place – the cashier where
concept, reflecting the fact that those measures are the sale was recorded. For our example scenario we
semantically linked. would define the dimension level Cashier in order to
represent the cashiers where the sales transactions take
place and each particular cashier of each store would be
3 The Multidimensional Aggregation Cube modeled as a dimension member of this level.
(MAC) Data Model
In this section we present the Multidimensional Dimension levels can have one or more attributes. A
Aggregation Cube data model. MAC is a user-centric subset of those attributes always form a key for the
conceptual data model that attempts to cover the dimension level. In most of the cases a single attribute
requirements described in the previous section in order to acts as the key but multi-attribute keys can also exist. This

A. Tsois, N. Karayannidis, T. Sellis 5-4


is due to the set semantics of dimension levels. The set level(s). It can be proved that this requirement does not
semantics guaranties that each dimension member restrict the modeling power but can only affect the way
represents a uniquely identifiable, within the level, dimension members are grouped into dimension levels.
property instance.
Unlike dimension levels the drilling relationships cannot
Our example scenario requires several dimension levels to have attributes. This is due to the simple decomposition
be defined. In fact all levels shown in Figure 1 will be semantics of the drilling relationships. Drilling
modeled as dimension levels of the MAC model. The relationships should not be used to represent general
attributes of each level depend on the available data and relationships among dimension members and so attributes
the analysis requirements. For example the Residence are not allowed.
City level could have the attributes ID, Name and
Population where the attribute ID is the key of the level, In our example scenario we would have a simple drilling
Name is the real world name of the city and Population relationship linking the level Cashier to the level Store.
stores an estimation for the population of the city. The Store level would be the parent level of the
relationship and the level Cashier would be the only child
3.2 Drilling Relationships level. Each member of the Store level would be related to
one or more members of the Cashier level indicating the
A drilling relationship is a special kind of an n-way cashiers of each store.
relationship that relates one dimension level (called the
parent of the relationship) to N-1 dimension levels (called A more complicated example is the relationship among
the children of the relationship). Formally speaking, a the levels Residence Region, Residence City and
drilling relationship is an n-way relationship, which can Residence Area. The level Residence Region contains as
relate a member of the parent level to one or more members all the geographical regions of interest. The
members of the children levels. This means that an level Residence City contains all the cities of those
instance of a drilling relationship cannot relate only regions and the level Residence Area contains individual
members of the children levels without linking them to geographical areas in which cities and regions can be
exactly one member of the parent level1. Furthermore, the decomposed. Obviously the cities do not cover all the
parent dimension level of a drilling relationship cannot act areas of the regions so some areas can only be linked
as a child level of this relationship. The reasons for directly to regions. In order to represent the above
imposing the above restrictions will be explained next situation we define two drilling relationships. The first
after defining the semantics of a drilling relationship. one has the Residence Region as the parent level and the
Residence City and Residence Area as child levels. This
A drilling relationship is used to represent the way in relationship links to each region the appropriate cities and
which a member of one level can be decomposed into areas. The second drilling relationship has Residence City
members of some other level. We will explain the as parent level and Residence Area as child level and
semantics of the drilling relationship with a very simple represents the decomposition of cities into areas.
example: assume that a drilling relationship
Store_to_Cashier(Store, Cashier) relates the member One can argue that drilling relationships could always be
Store007 of the parent level Store to the members simpler and have only one child. In the above example,
Cashier00701 and Cashier00702 of the unique child level dummy cities could be used for grouping areas that are
Cashier. The semantics of this relationship is that if we not within a real city. Still, if the cities where not directly
have a measure value characterized by property value related to areas but rather several levels existed among the
Cashier00701 and another, similar measure value, Residence City and Residence Areas then we would need
characterized by the property value Cashier00702, we can dummy members on each intermediate level. We believe
then compute (by applying the proper aggregation that such a modeling solution is semantically wrong since
function) the measure value with the property Store007. we use dummy members that do not correspond to the
The word “similar” in the above definition means that real world. Furthermore, such a solution would result into
apart from the Cashier00701 and Cashier00702 unnatural and complicated drill-down operations.
properties, the two measure values are characterized by
the same set of property values.
3.3 Dimension Paths
Based on the above semantics of a drilling relationship, A dimension path is a set of drilling relationships used to
one can easily see that if the parent level is also a child model a meaningful sequence of drill-down operations. In
level then we can end up with recursive calculations for its simple form, a dimension path is a sequence of drilling
measure values. Such a situation is meaningless in OLAP relationships each having only one child level. In this
applications so we require that the parent level of a chain each child level of a drilling relationship is also the
drilling relationship is always different from the children parent level of the next drilling relationship, except for the
last one. The child level of the last drilling relationship is
1
Note that we do not require the relationship to be strict. called the detailed level [Vass98] of the dimension path.
Two parent-level members can be related to the same
child-level member.

A. Tsois, N. Karayannidis, T. Sellis 5-5


The dimension paths are defined in order to model the relationships defined by the path. The ALL level is related
paths on which the multidimensional analysis is usually to the highest available level of the path and not directly
performed. In the field of multidimensional analysis the to the detailed level.
drill-down and roll-up operations follow pre-designated
paths rather than individual drilling relationships. This For our example scenario we need to define several paths.
means that even if a level is a parent of multiple drilling In fact, each path described in Figure 1 will also be a
relationships the drill-down operation will be performed dimension path of the MAC model. To almost each path
based on only one of these relationships – the one that of Figure 1 we would add the special dimension level
belongs to the dimension path on which the analysis is ALL. Only for the path P7 we would not add the ALL level
currently performed. since it is not meaningful to aggregate the members of the
advertising method. Recall that a product may be related
In order to formally define a dimension path we will first to more than one advertising method. Figure 3 gives a
define what is the graph of a set of drilling relationships. graphical representation of the dimension levels, drilling
Given a set P of drilling relationships, the graph of P is a relationships and dimension paths for the dimensions Item
directed graph with the following properties: and Location of our example scenario.
• Each node of the graph represents a level referenced
by some member of P. Even if a level is referenced $//
by many drilling relationships, as parent or child
level, the graph will contain only one node that
represents that particular dimension level. :DUHKRXVH 6WRUH
• For each drilling relationship in P and for each child 5HJLRQ 5HJLRQ

level of this relationship, the graph contains a


directed edge from the parent of the drilling 3URGXFW :DUHKRXVH 6WRUH &LW\
&ODVV &LW\
relationship to the child level. $GYHUWLVLQJ :DUHKRXVH 6WRUH $UHD
0HWKRG $UHD

We formally define a dimension path to be a non-empty 3URGXFW


*URXS %UDQG
set P of drilling relationships that have the following
:DUHKRXVH 6WRUH
properties: 'LPHQVLRQ
/HYHO
1. In the graph of P exactly one node has no incoming
'ULOOLQJ
edges. 3URGXFW &DVKLHU
UHODWLRQVKLS

2. The graph of P has no circles – it is a DAG. 3 3 3 3


3 'LPHQVLRQ
SDWK
3. There are no two drilling relationships in P having 'LPHQVLRQ
,WHP /RFDWLRQ
the same parent level.

The first of the above properties requires that paths


Figure 3: The dimensions Item and Location
always have a unique detailed level. This is required in
order to guaranty the aggregation semantics of paths. The
detailed level of a path is considered to correspond to the 3.4 Dimensions
source of information. Based on this detailed data all A dimension is a concept used to define meaningful
other aggregation levels can be computed. In other words, groups of dimension paths. This grouping is essential in
through the composition of drilling relationships of a path order to model the semantic relationships that exist among
each member of any level (of this path) is finally linked to the various paths and allows powerful OLAP modeling,
a uniquely identifiable set of detailed level members as we will show in our examples to follow.
[CaTo98].
The dimensions are complex concepts used in our model
Note that, based on the above definition; a dimension path to represent the various properties of measures as well as
has no build-in support for aggregating all members of its assist the process of multidimensional analysis. With their
detailed level although this is a very common operation in complex structure, the dimensions can be considered to
the multidimensional analysis. For example, our first classify properties into various levels, define the
example query requires aggregation for all customers and relationship among properties of different levels and
our second example query requires aggregation for all describe meaningful drilling paths.
products. In order to achieve such an aggregation within a
path, a special dimension level (usually called ALL Formally speaking, a dimension is a non-empty set of
[G++96]) containing a unique member (called all) must be paths. If a dimension contains more than one path then
defined. Furthermore, an artificial drilling relationship each path must have at least one common level with at
that will complete the link of all detailed level members to least one other path of the dimension. The common levels
the unique member of the ALL level must also be defined. must be assigned the same meaning in the various paths
“Complete the link” means that either the members of the where they appear. For example, if we replace in our
detailed level are directly related to the all member or example both Residence City and Store City levels with a
indirectly related to it through a sequence of drilling general City level then the paths P1 and P4 end up having

A. Tsois, N. Karayannidis, T. Sellis 5-6


a common level. Those two paths, although they share a Based on the above definition one can realize that the
common level, cannot be combined into a dimension dimension values depend on both the structure of the
because they assign different meaning to their common dimension and on the particular instances of its levels and
level. The City level in P1 would represent the residence drilling relationships. The set of all possible dimension
cities of customers while in P4 it would represent the values defined for a dimension is called a dimension
cities in which stores are located. domain. Note that although the instance of the dimension
domain depends on the instances of levels and drilling
The reason for grouping dimension paths into dimensions relationships, the schema of the dimension still
is the semantic relationship that usually exists among determines the structure of the dimension domain by
various paths. For example, consider the paths P4 and P5 defining all possible combinations of non-related levels.
of Figure 3. Both the above paths characterize the sales
measures from the store point of view. For both paths, the Take as an example the dimension Client the graph of
level Store is used to represent the stores where the sales which is shown in Figure 4. Each member of the levels
are recorded so it is meaninful to group these two paths Customer, Residence Area, Residence City, Residence
and view them as one Location dimension of the sales Region and Profession is a dimension value. If
measure. Generally speaking we can say that the number Athens_North, Athens_East are two members of the
of paths involved to define a dimension depend on the Residence Area level and doctor, teacher are two
content of the dimension and on the complexity of the members of the Profession level then each of these
scenario being modeled. We define the graph of a members is a dimension value. Also, each combination of
dimension to be the union of the individual graphs of its members: one from the Profession level and one from the
dimension paths. Recall that in the graph each dimension Residence Area may be a valid dimension value. If the
level is represented by a unique node so according to the Customer level contains a member that rolls-up to both
above definitions any dimension graph is a connected the doctor member and the Athens_North member then
graph. Figure 4 gives the graph of the dimension Client. (and only then) the set {Athens_North, doctor} is a valid
dimension value. This customer member would represent
a doctor customer living in the North of Athens. Similar
&XVWRPHU 5HVLGHQFH 5HVLGHQFH 5HVLGHQFH $//
$UHD &LW\ 5HJLRQ constrains must hold in order for the sets {Athens_North,
teacher}, {Athens_East, doctor}, and {Athens_East,
3URIHVVLRQ teacher} to be dimension values. The combination of
members is possible only for non-related levels. This
means that if Patra is a member of Residence City then
Figure 4: The graph of the dimension Client the set {Athens_North, Patra} is definitely not a valid
dimension value. Either Athens_North is an area of the
The dimensions are of extreme importance in our model city Patra and it is redundant to mention Patra or
because they are used in the definition of cubes. Still, a Athens_North is not within Patra and the combination of
cube definition does not involve dimensions as sets of those members is meaningless.
paths but rather as sets of what we call dimension values2.
A dimension value can either be a simple dimension The intuition behind dimension domains is that the
member or it can be a set of dimension members. dimension values represent all possible properties that
Formally speaking, we define a dimension value of a can be used for multidimensional analysis. For example a
dimension D to be either a dimension member of a level query may ask for customers living in Athens_North or it
of D or a set of two or more compatible dimension may ask for customers with the doctor profession that live
members belonging to distinct and non-related levels. in the Athens_North area. Still, it is not meaningful to ask
for customers that live in the Athens_North area and in
Two or more dimension members are called Patra city at the same time, since the latter does not
“compatible” if it is possible to find - within any level of include the former. Furthermore if the Customer level
the dimension – at least one dimension member that can includes no members that are doctors living in the Athens
roll-up (using any combination of drilling relationships of East area then the dimension domain will not include such
the dimension) to each of these members. a combination of values making obvious the answer to
any such queries. So, the structure and content of the
The term “distinct” means that from each level, at most dimension domains can be used as a valuable source of
one dimension member can participate in a dimension information in a semantic query optimization algorithm.
value. Furthermore, two or more levels participating in a
dimension are called “non-related” if there is no path in 3.5 Multidimensional Aggregation Cubes
the directed graph of the dimension linking any two of
these levels. The Multidimensional Aggregation Cube (MAC) is the
main and most complex concept of our model. All other
concepts previously described are directly or indirectly
used in the definition of a MAC. The dimension levels
2
Note the difference among the terms dimension value model properties of measures, the drilling relationships
and dimension member define relationships among levels, the dimension paths

A. Tsois, N. Karayannidis, T. Sellis 5-7


group drilling relationships and the dimensions group their result at a completely different granularity.
paths. The MAC is the only concept that associates Furthermore, even the granularity of source data may
property values with actual measure values and stresses vary. For example, a store may change for a time period
the complex hierarchical structure defined by dimensions. the way it records its sales. The store could record only
the sum of sales per product for all its cashiers and not per
Formally speaking, a multidimensional aggregation cube product and cashier as it used to do before. Another
is an n-way relationship relating N dimension domains. example is when a cube contains predicted and actual
This relationship has one or more attributes which sales. The predicted values may not be computable at the
represent the measures of the MAC. Each instance of this lowest detail level but only at some higher levels. A final
relationship is called a cell and defines a relationship example is when for security reasons the detailed data
among one dimension value from each of the involved may not be available for stores of a particular area.
domains. The cell is annotated with the values of the cube
attributes – the measure values. The N dimension values If we had only single-granularity cubes than each time we
that a cell relates are called the coordinates of the cell. needed a new combination of levels we would have to add
Obviously, the measures of a cube are functionally a new cube to the schema. The schema would get
dependant on its coordinates. complicated and so would the queries. The schema would
not only have to include the dimensions and the cubes but
Assume the following example: The cube C1 is defined also the functional dependencies among the existing
over the domains of the dimensions Location and Item cubes. The queries would have to be aware of the
(Figure 3) having only Sum of sales as its measure. C1 available cubes as well as their functional dependencies.
contain the cells defined in Table 1 where S represents the They would have to select which ones to use and probably
Sum of sales measure: join some of them before defining one or more grouping
operations on parts of those cubes.
Cell Coordinates M
name Item Location S Our approach allows a single cube to include data of all
cell_A Product =P_A Cashier =Cashier00701 10 meaningful granularities. By doing so we simplify the
cell_B Product =P_B Cashier =Cashier00701 20 schema making it usable by the users. Also, this approach
cell_C Brand =B_1 Cashier= Cashier00701 30 allows queries to be expressed in a very elegant and
straightforward way avoiding any declaration of joins.
Table 1: The cells of the example cube C1
Consider the following example: Let C2 be a cube
The measure of cell_A represents the sum of sales done at defined over the domain of the dimensions Time,
the Cashier00701 for the product P_A. Likewise the Location, Item, and Client. The structure of those
measure of cell_B represents the sum of sales done at the dimensions is illustrated in Figure 1 with the addition of
Cashier00701 for the product P_B. Finally, cell_C the special level ALL as described earlier. Assume
represents the sum of sales done at the Cashier00701 for SumOfS is the unique measure of C2 and it represents the
all products of the brand B_1. Assuming that P_A and sum of sales. Using this schema the query Q4, described
P_B are the only products of the brand B_1, the cell_C in section 2, can be expressed in a very simple manner.
then represents the aggregation of cell_A and cell_B. This We use a QBE notation style and define the query Q4 in
means that the measure of cell_C must be equal to the Table 2. The P. notation defines which coordinates and
sum of measures of cell_A and cell_B. measures to be returned in the result set.

The above example reveals the key property of the MAC Time Location Item Client SumOfS
model: the instances of a cube (the cells of a cube) can P.Month P.Store_Area._x P.Brand ALL P.
represent measure values of different granularities even if Year.2000 Store_Area._x ALL ALL >_s*1.1
there is a functional dependency among them. In our Year.1999 Store_Area._x ALL ALL _s
example, cell_A refers to sales measured per Product and
Cashier while cell_C refers to sales measured per Brand Table 2: The query Q4
and Cashier. Those cells, although defined at different
levels of granularity, can be part of the same cube. This is An additional advantage of our approach is that the most
due to the definition of the dimension domain, which typical OLAP operations: drill-down, roll-up, slice and
states that all members of all participating levels are valid dice are translated to simple selection queries on the cube.
dimension values. For example, a drill-down on the results of Q4 to the
Product level can be expressed by simply replacing Brand
We believe that the above property of our cubes is crucial with Product at the Item coordinate of Table 2.
for the compact and intuitive representation of
multidimensional data. As explained in the section 2, the Nevertheless, there is an important argument against this
OLAP users usually handle data defined over various modeling approach. The cube can represent cells with
levels of granularity. Queries may impose selection measures that are functionally dependent but it cannot
conditions on various dimension levels and may require guaranty their consistency. The measure of cell_C could

A. Tsois, N. Karayannidis, T. Sellis 5-8


have been 50 in Table 1 without violating in any way the During the last few years a plethora of multidimensional
definition of the cube C1. Still, from a semantically point data models for data warehouses and OLAP have been
of view this value would be inconsistent with the values proposed. A comparison of some of them can be found in
of cell_A and cell_B. This inability to guaranty [VaSe99] and [SBH99]. We are currently aware of 12
consistency comes from total absence of the involved models that have been published in research papers. Most
aggregation functions. Each measure of a cube is of them are logical data models and only few ([TBC99]
semantically related to some aggregation. In the above [S++98]) can be considered as purely conceptual. Each of
examples the measure Sum of sales is obviously related those models has taken a somehow different modeling
to the SUM aggregation function. Still, the cube definition approach ranging from a simple global table to
does not include this relationship making impossible any sophisticated object classes.
consistency checking.
In order to demonstrate that our model is not ‘yet another’
We have intentionally chosen not to include the multidimensional model we evaluated all 12 published
aggregation functions in the MAC model for a number of models against the requirements described in section 2.
reasons. We decided not to define operations, not to This may not be fair for the purely logical models since
include aggregation functions and not to cover any other the requirements represent conceptual modeling needs.
functional aspects of data in the current model because Still, the evaluation is done only to demonstrate that none
those aspects are orthogonal to the defined concepts. The of the models published so far has the expressive power
model can be extended to include aggregation functions, of MAC. In fact the evaluation shows that one
consistency-checking algorithms and operations without requirement is not satisfied by any of the models and even
changing any of the existing MAC concepts. Furthermore, for the remaining requirements there is no model
a separate functional model can cover those needs. It satisfying all of them. Since our model can satisfy all the
seams that a separate functional model is more suitable requirements of the evaluation we argue that our proposal
because each application domain requires its own specific is an improvement to the existing status.
aggregation functions, operations and consistency
semantics. The requirements of our evaluation are presented in the
following list. Each requirement states what the model
For example in some applications it might be acceptable should be capable of representing within a schema.
to view and analyze data that includes inconsistent parts 1. Levels within dimension (even in the form of simple
(maybe due to missing, incomplete or wrong attributes).
information). A separate functional model can be tailored 2. Grouping/classification relationships among levels.
to cope with operations on such data. By forcing 3. Many-to-many type of grouping/classification
consistency at the data modeling level we would make relationships.
our model inappropriate for those applications. 4. N-way grouping/classification relationships that
relate n dimension levels.
4 Related Work 5. Grouping/classification relationships that do not
require total participation of the involved levels.
Modeling multidimensional data is not an OLAP specific 6. Analysis paths.
issue. In the database community, several research areas 7. Multiple measures as part of one concept.
like statistical databases, scientific databases, 8. Measures defined at any granularity level – for each
geographical databases and temporal databases deal with involved dimension.
multidimensional data. Still, each of these areas has 9. Measure values defined over various granularity
particular modeling needs and has developed specialized levels as part of one concept.
multidimensional data models. The area closest to data 10. Measure values characterized, for some of its
warehouses and OLAP is the statistical database area dimensions, by more than one dimension level
[Shos97] where several multidimensional models have members.
been proposed [OOM85], [RR91]. In fact those models
where proposed long before the appearance of the term Note that the aggregation level of a measure value is the
“OLAP” [Codd93]. lowest dimension level that can be used to characterize
this value. Also, an analysis path is a lattice of
In the data warehouse and OLAP area the first grouping/classification relationships defined on a set of
multidimensional data models where developed by levels. This lattice prevents the user from performing a
product vendors as the research in the OLAP domain has meaningless (according to the schema designer) drill-
followed the evolution of industrial products. Vendors as down or roll-up operation to an arbitrary -outside the
still using and developing their own data models. Also, lattice- level of the dimension.
various standardization bodies have defined their own
models [Meta97] [Olap97] [TPC99]. Due to space
limitations we are not going to discuss any of the
previously referenced models but refer the reader to
[VaSe99] for an overview and comparison of those
models.

A. Tsois, N. Karayannidis, T. Sellis 5-9


1 2 3 4 5 6 7 8 9 10 [AGS97]) do so by providing the appropriate
[AGS97] 9 9 transformation operations. So, this requirement does not
[CaTo98] 9 9 9 9 mean that dimensions and measures are represented in the
[DaTh97] 9 9 9 same manner by the model. We believe that our model
[GoRi98] 9 9 9 9 9 9 can easily support this requirement through the definition
[GyLa97] 9 9 9 9 9 of the proper transformation operations (initially called
[Lehn98] 9 9 9 9 Push and Pull by [AGS97]).
[LiWa96] 9 9 9 9
[PeJe99] 9 9 9 9 9 9 9 9 5 Conclusions
[S++98] 9 9 9 9 9 In this paper we addressed the problem of conceptual
[TBC99] 9 9 9 9 9 9 modeling of data used in multidimensional analysis. We
[Truj99] 9 9 9 presented a set of modeling requirements through the use
[Vass98] 9 9 9 of examples and with those requirements in mind we
MAC 9 9 9 9 9 9 9 9 9 9 defined a new conceptual data model, named MAC. The
proposed model uses concepts familiar to OLAP users,
Table 3: Evaluation of multidimensional models like dimensions, levels, paths, measures and cubes. Those
concepts are properly defined in order to allow modeling
The result of our evaluation is shown in Table 3. Note that of complicated real-world scenarios. Our evaluation and
some of the models ([GyLa97], [Truj99], [AGS97]) comparison to previously published models showed that
represent relationships among levels using user-defined MAC offers a unique combination of modeling skills. Our
functions, which are then used in operations. Also, other model is the first user-centric conceptual model to define
models ([LiWa96], [Lehn98], [DaTh97]) leave the cubes as multi-granularity relationships making both
relationships to be defined by the particular data instances schemas and queries much more simple and intuitive. The
and provide no schema definition for them. In both cases model defines dimension levels, drilling relationships,
we considered that the requirements 2,3,4,5 involving dimension paths and dimensions as first-class and
grouping/classification relationships as part of the schema standalone concepts, making it possible to share those
are not met. concepts among multiple cubes. Furthermore, the
complexity of drilling relationships and the usage of
The requirement not met by any of the models is the analysis paths in the definition of dimensions are
concept of an analysis path. We believe that this additional novelties of our model taking a step beyond the
information is an important structural part of the classical multiple hierarchies. Finally, note that the
dimension design and it should be represented at the definition of dimension domains implicitly represents a
conceptual level. straightforward method for semantic query optimization
at both the schema and the instance level.
Although our model seams to be able to model a broader
range of OLAP scenario than other proposed models, Future work includes the definition of MAC as an
there are a few requirements mentioned in several papers extension to the E/R model and the research of a suitable
([Codd93], [PeJe99], [TBC99] [GyLa97]), which are not logical model on which concepts of our model can be
satisfied by MAC. In our opinion, the most important of mapped. We also plan to define a functional model that
such requirements is the support for correct aggregation will include aggregation functions, derived measures, and
of data. As described in [LeSh97] the measures cannot operations and will define the summarizability [LeSh97]
always be consistently aggregated by an arbitrary of measures as well as other consistency rules.
aggregation function. In order to provide support for
correct aggregations the model must include additional Acknowledgements
information regarding measures and This work has been partially funded by the European’s
grouping/classification relationships. Our model does not Union Information Society Technologies Programme
include such additional information since we believe that (IST) under project EDITH (IST-1999-20722).
this kind of information, as well as information about
aggregation functions and derived measures, can be References
described by an independent functional model which will
supplement MAC. [AGS97] R. Agrawal, A. Gupta, S. Sarawagi: Modeling
Multidimensional Databases. Proc. of the ICDE 1997.
A second important requirement stated by various papers
([GyLa97], [PeJe99], [AGS97]) is the need for symmetric [BCN92] C. Batini, S. Ceri, S. Navathe: Conceptual
treatment of dimensions and measures. It is important to Database Design. Benjamin/Cummings, 1992.
note that what the authors finally mean by symmetric
treatment is the ability to transform a measure into a [CaTo98] L. Cabibbo, R. Torlone: A Logical Approach to
dimension and the other way around. All models claiming Multidimensional Databases. Proc. of the EDBT
to support this requirement ([GyLa97], [PeJe99], 1998.

A. Tsois, N. Karayannidis, T. Sellis 5-10


[ChDa97] S. Chaudhuri, U. Dayal: An overview of Data [PeJe99] T. B. Pedersen, C. S. Jensen: Multidimensional
Warehousing and OLAP technology. ACM SIGMOD Data Modeling of Complex Data. Proc. of the ICDE
Record, 26(1), March 1997. 1999.

[Codd93] E. F. Codd: Providing OLAP to user-analysts: [RR91] M. Rafanelli, F.L. Ricci: A functional model
An IT mandate. E.F. Codd and Associates, 1993. for macro-databases. SIGMOD Record, 20(1), March
1991.
[DaTh97] A. Datta, H. Thomas: A conceptual Model and
an algebra for On-Line Analytical Processing in Data [SBH99] C. Sapia, M. Blaschka, G. Höfling: An
Warehouse. Proc. of the WITS 1997. Overview of Multidimensional Data Models for
OLAP. Technical Report 1999.
[G++96] J. Gray et al.: Data Cube: A Relational http://www.forwiss.tu-muenchen.de/
Aggregation Operator Generalizing Group-By, Cross-
Tab and Sub-Totals. Proc. of the ICDE 1996. [S++98] C. Sapia, M. Blaschka, G. Höfling, B. Dinter:
Extending the E/R model for the Multidimensional
[GoRi98] M. Golfarelli, S. Rizzi: A Methodological Paradigm. Proc. of the DWDM 1998.
Framework for Data Warehouse Design. Proc. of the
DOLAP 1998 [Shos97] A. Shoshani: OLAP and statistical databases:
Similarities and differences. Proc. of the PODS 1997.
[GyLa97] M. Gyssens, L.V.S. Lakshmanan: A
Foundation for Multi-Dimensional Databases. Proc. [TBC99] N. Tryfona, F. Busborg, J. G. B. Chistiansen:
of the VLDB 1997. starER: A Conceptual Model for Data Warehouse
Design. Proc. of the DOLAP 1999.
[Inmo96] W.H. Inmon: Building the Data Warehouse.
John Wiley & Sons, 1996. [TPC99] TPC: TPC Benchmark H and TPC Benchmark
R. Transaction Processing Council. June 1999.
[Kimb96] R. Kimball: The Data Warehouse Toolkit. John http://www.tpc.org/
Wiley & Sons, 1996.
[Truj99] J. Trujillo: The GOLD model: An Object
[Kimb97] R. Kimball: A Dimensional Modeling Oriented multidimensional data model for
Manifesto. DBMS, August 1997. multidimensional databases. Proc. of the ECOOP
1999.
[Lehn98] W. Lehner: Modeling Large Scale OLAP
Scenarios. Proc. of the EDBT 1998 [VaSe99] P. Vassiliadis, T. Sellis: A Survey of Logical
Models for OLAP Databases. SIGMOD Record 28(4),
[LeSh97] H. Lenz, A. Shoshani: Summarizability in Dec. 1999.
OLAP and Statistical Databases. In Proc. of the
SSDBM 1997. [Vass98] P. Vassiliadis: Modeling Multidimensional
Databases, cube and cube operations. Proc. of the
[LiWa96] C. Li, X. S. Wang: A Data Model for SSDBM 1998.
Supporting On-Line Analytical Processing. In Proc. of
the CIKM 1996.

[Mendel] A. O. Mendelzon: Data warehousing and


OLAP: a research-oriented bibliography .
http://www.cs.toronto.edu/~mendel/dwbib.html.

[Meta91] Metadata Coalition: Meta Data Interchange


Specification. (MDIS Version 1.1), August 1997.

[Olap97] OLAP Council: OLAP and OLAP Server


Definitions. 1997
http://www.olapcouncil.org/reasearch/glossary.htm

[OOM85] G. Ozsoyoglu, M. Ozsoyoglu, F. Mata: A


Language and a Physical Organization Technique for
Summary Tables. Proc. of the SIGMOD 1985.

A. Tsois, N. Karayannidis, T. Sellis 5-11

You might also like