0% found this document useful (0 votes)
73 views10 pages

Modeling the Data Warehouse and Data Mart

Uploaded by

WaRobin La
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
73 views10 pages

Modeling the Data Warehouse and Data Mart

Uploaded by

WaRobin La
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Home 1

Modeling the Data Warehouse and Data Mart


by Paul Winsberg • An operational data store also style of warehouse is usually con-
has a broad, enterprise-wide scope. trolled by the central IT group, and

T here’s hope yet for all you grizzled


old data modelers. Just when you
thought you knew everything, along
Unlike the enterprise data ware-
house, however, it is refreshed in
near real-time and used for routine
improves the consistency of infor-
mation in the data marts. It also
reduces the number of extracts
comes the data warehouse. Your as- business activity. Often this kind of from operational systems. It takes
sociates mutter strange new words warehouse contains detailed cus- twenty extracts to load four data
like star schema, cubes, fact tables, tomer data in support of customer marts directly from five operational
and you nod wisely as if you under- service. In some respects, it is more systems. It takes only nine extracts
stood. Your head spins as you ponder like transactional system than a to load the data marts indirectly
data warehouses, marts, malls, and data warehouse — it contains accu- through an enterprise warehouse.
stores. Forget about normal form and rate data and supports business op-
• Alternatively, the enterprise
redundancy, it doesn’t matter be- erations. The difference is that it is
warehouse may evolve bottom-up
cause it’s not transactional. After a not updated directly by users, and
as a new layer on top of existing
dull decade, data modeling is exciting it may merge and sanitize data
data marts. In this case, the data
again. from several systems.
marts are loaded directly from
What’s different about modeling a
• A data mart has a limited scope — source systems, and the enterprise
data warehouse? That’s the question
it supports a particular region, warehouse is loaded from the data
I’ll explore in this article.
business unit, or business func- marts. This approach is not ideal,
Warehouse Architecture tion. A data mart may contain sales but expedient when data marts are
information for a specific region or built before the enterprise ware-

T he concept of a data warehouse


first appeared in articles publish-
ed in the late 1980s by Bill Inmon.
product line, for example. In com-
parison, an enterprise data ware-
house.

• Occasionally, a multi-tier ware-


house contains sales data for all
Devlin and Murphy also published an house may include elements of
regions and most products, or is at
article on the subject in 1988 in the both the top-down and bottom-
least is designed with this in mind.
IBM Systems Journal. Early an- approaches.
A data mart may be built by central
nouncements of products and archi-
IT, but more often than not, it is In the long run, a multi-tier ware-
tectures came from IBM in 1991 and
managed independently by a de- house is the best architecture. It
Hewlett-Packard in 1993. Since then
partment or workgroup. provides a single source of clean, inte-
the technology and terminology have
grated data, as well as local stores tai-
evolved — there are now three impor- As you can see, the difference be-
lored to the needs of specific groups.
tant types of data warehouse: tween an enterprise warehouse and a
However it is more difficult to build
data mart is essentially a matter of
• An enterprise data warehouse and manage, and in the short term,
scope.
provides a central database for deci- many organizations will begin with
Many organizations have imple-
sion support throughout the enter- isolated enterprise warehouses or
mented a multi-tier warehouse, in
prise. This is the approach first data marts.
which an enterprise warehouse coex-
envisioned at IBM and by Bill In-
ists with several data marts. There Warehouse Models
mon. Although an enterprise data
are several variations on this
warehouse may start with a limited
subset of enterprise data, it is de-
signed to expand over time. Most
approach:

• In the top-down approach, data


T here are two kinds of data models
for enterprise warehouses and
data marts:
enterprise data warehouses are flows from source to enterprise
• A dimensional model represents
managed and controlled by the warehouse to data marts. This
data as a hypercube or array (see
central IT organization.
Figure 1). The values in the array

InfoDB Volume 10 Number 3


2
tools are also known as OLAP (on-line a high level. In practice, many en-
analytic processing), and come from terprise modeling projects failed
vendors like Arbor Software and Busi- because of high cost and an uncer-
ness Objects. As the name implies, tain payoff. Even a failed project
multidimensional tools support only may have some residual value, how-
market

dimensional design. Relational prod- ever. It may be feasible to docu-


ucts are historically oriented to tabu- ment subject area models —
e lar design, but in recent years subsets of your enterprise model
tim
products have evolved. IBM has that describe a specific part of your
product enhanced the DB2 query optimizer business, like customers or prod-
for dimensional design, for example, ucts. At the least, the enterprise
Figure 1. Hypercube design
and vendors like MicroStrategy and modeling project usually results in
Information Advantage market dimen- consistent naming conventions and
sional query tools, known as a list of attribute names and defini-
are numeric facts that measure relational OLAP. tions, which are useful in mapping
business performance, such as In addition to tabular and dimen- source data to the warehouse.
revenue or units sold. The dimen- sional models, two other models are
sions of the array are parameters of We can portray these models in a
important in warehouse design:
the fact, such as date of sale, diagram with four quadrants (see
region, and product. • Source documentation captures Figure 2). The top of the diagram de-
the physical data structure of an picts the enterprise or data perspec-
• A tabular model looks quite differ- operational system. Strictly speak- tive, while the bottom depicts the
ent from a dimensional warehouse. ing, this is not a data model, be- workgroup or processing perspective.
In fact it looks more like a transac- cause it directly represents the Models on the left are reverse engi-
tional database, as the name im- physical database structure. neered from source systems, while
plies, with denormalized, derived, However we include it on our list be- models on the right are forward engi-
summary, and historic data. It cause it is the starting point for the neered for new warehouse systems.
supports queries and reports for a modeling process. This diagram is not quite accurate.
specific business function like fi- In principle, the scope and structure
nance, customer support, or ship- • The enterprise model is the holy
of a warehouse are independent —
ping. Unlike a hypercube, it is not grail of data administration, circa
either enterprise warehouses or data
designed around a few numeric 1980. In principle, it describes all
marts can have a tabular or dimen-
facts. It may even contain near real- important data in the enterprise at
time data from an operational data
store.
reverse engineering forward engineering
These models have different goals.
Dimensional design supports nu-
enterprise/
meric analysis — questions about data Enterprise Model Tabular Model
money, amount of sales, and number orientation
of customers. Tabular design is not
optimized for numeric analysis; it
answers both numeric and textual
questions. Dimensional design often
contains only summary data, while
tabular design usually contains de- workgroup/
process Source
tailed data as well as summary data. Dimensional Model
orientation Documentation
These models are implemented in
four database technologies — multidi-
Figure 2. Four models for warehouse design
mensional, relational, bit-map, and
inverted index. Multidimensional

InfoDB Volume 10 Number 3


3
sional design. In practice, most Let’s review these transformations dresses, telephone numbers, and
enterprise warehouses are tabular, one at a time. dates. Numeric values are usually
however, so the diagram is a useful Create a common key. There are straightforward — it is relatively easy
framework for understanding and actually two problems here. At the to map integer data to float, for exam-
modeling the warehouse. level of metadata, you must map key ple. Because of the massive amount
Of course you may not actually fields in source documentation to key of data in many source systems, tools
document all four models when you columns in the enterprise model. are helpful. Two helpful products are
build a warehouse system. If you are This problem is usually manageable, Migration Architect from DBStar, and
building an isolated data mart, for because keys are often carefully docu- Integrity from Vality Technologies.
example, it’s enough to document the mented in source systems. Both of these products inspect data
source and data mart models. How- The second problem is mapping key in legacy systems and identify anoma-
ever it is important to have a frame- values from source to enterprise. This lous values using statistical tech-
work even in simple cases, so that problem is fundamental. Which re- niques.
you can plan wisely for future ware- cords represent the same thing? If the mapping of data values can
house development. Which records represent different be expressed with a simple decode ta-
Note on terminology! I’ll use the things? This problem is also quite ble, then record it in the warehouse.
terms table, key, row, and column to difficult, because you cannot usually This is not always possible, however;
describe enterprise and warehouse apply a simple algorithm — you’ll recording the mapping of names and
models, rather than entity, relation- need a variety of heuristics, and in addresses may be more trouble than
ship, instance, and attribute. These some cases, manual identification. it’s worth.
terms are simpler and easier to un- The metadata map is documented Eliminate repeating groups.
derstand. I’ll use file, pointer, record, in your dictionary, of course. It’s also Many legacy databases employ
and field to describe all types of good practice to record the data map repeating groups, or several values
source data, including flat file, net- in a table in the warehouse, for main- embedded in a single logical column.
work, hierarchic, and relational. tenance and debugging of warehouse The COBOL OCCURS clause is an
data. Store the map in a separate ta- example. Once you have identified
From Source Documentation ble, because it has no business value repeating groups, it is a simple mat-
to Enterprise Model and will interfere with performance ter to solve this problem. Repeating

T he problem of mapping source


documentation to an enterprise
model is neither new nor particular to
and maintenance of the rest of your
data.
Create common columns. This
groups map to a separate table that
contains the key of the source record,
a column representing the repeating
the warehouse. It’s called reverse en- problem is similar to creating com- group, and nothing else. For example
gineering. It’s a nasty problem that mon keys, and also has metadata an employee record with a repeating
has been studied at length, and there and data components. Mapping field skill field maps to a table with two col-
are many special tricks and tech- names to a common column name is umns, EMPLOYEE-ID and
niques. Here are a few of the a basic problem in enterprise model- SKILL-CODE.
important transformations: ing. It requires a column naming con- The primary key of the table con-
vention for the enterprise model. This sists of both columns. In other words,
• create a common key for equivalent
usually consists of three parts, a the table is all key. The first column
records
root, qualifier, and data type. The root (EMPLOYEE-ID in our example) is a
• create common columns for equiva- is the name of the table that contains foreign key to the parent table (EM-
lent fields the column; it may be omitted when PLOYEE). The foreign key rule is de-
it is clear from the context. The class lete cascade.
• eliminate repeating groups is a logical data type defined by data Extract columns from over-
• extract columns from overloaded administration, like meters or money. loaded fields. An overloaded field
fields The qualifier describes the meaning contains more than one kind of data.
of the column, and is omitted when There are two kinds of overloaded
• document functional transforma- the class alone is self-explanatory. An fields — alternating and encoding
tions example of this convention is EM-
PLOYEE-BIRTH-DATE.
• document composite transforma-
The data problem usually occurs
tions
with character data like names, ad-

InfoDB Volume 10 Number 3


4
see Figure 3). Alternating values
occur when different kinds of informa- ALTERNATING DATA
QUANTITY means sales for
tion appear in the same field of EMPLOYEE ID JOB_TYPE QUANTITY AGENT but lines of code for
e1 programmer 2550 PROGRAMMER
alternate records. Encoding occurs e2 programmer 1820
when several kinds of information e3 ••• agent 14 • • •
e4 programmer 4325
appear in a single field. Encoding can e5 agent 4
involve string concatenation or a com-
plex numeric algorithm. Overloading
is a common problem in flat file, hier- ENCODED DATA

archic, and network systems. Unpack EMPLOYEE ID SS_NUMBER FULL_NAME NUMBER


234 123-45-6789 Ronald Reagan 1543
overloaded fields into several col- 235 987-65-4321 Nancy Reagan • • • 6320
umns of the enterprise model, and 238 111-11-1111 Dick Butler 1555
document the transformation.
if number is even,
Document functional transforma- then employee is salaried
if not, employee is hourly
tions. Functional transformations
are usually easy — just apply a
formula to convert a source field to Figure 3. Overloaded fields
an enterprise column. This includes
simple unit conversions, for example,
or complex calculations of insurance time to build one — you’ll do this • Model historic data.
premiums. If the algorithm is too com- work as you model the warehouse,
plex to document in the dictionary, • Model changing metadata.
one way or another. For a quick start,
reference a computer program or pro- you can purchase generic subject • Introduce derived data.
cedure that implements it. area models from vendors like Prism
Document composite transforma- Solutions. • Summarize transactions.
tions. Often an enterprise column is Most likely your subject area mod- • Merge tables for performance.
derived as a combination of transfor- els are in third normal form and con-
mations. For example, you may need tain no derived or summary data. • Convert column-wise data to
to extract a dimension of a part from They usually consist of a central inde- row-wise.
an overloaded descriptor and convert pendent entity and related dependent
from feet to meters. An insurance pre- • Restructure according to data
and subset entities. Some entities
mium may be derived from several volatility.
may have relationships to external en-
source columns. Because an enter- tities in other subject areas. If you’re Filter operational data. Subject
prise column may be derived from lucky, you’ll find this basic documen- area models contain many attributes
many source fields, record the trans- tation in your dictionary: that are not interesting to business
formation(s) in the enterprise model, executives. For example, an insur-
not in the source documentation. • Entities have a name, description,
ance policy table may contain three
primary key, and attributes.
different dates. Date written indicates
From Enterprise Model to
• Attributes have a name, descrip- when the insurance agent closed the
Tabular Model deal. Date of record indicates when
tion, data type, and minimum (zero

Y ou probably won’t find a complete


enterprise model, but no matter.
You probably won’t have time to build
or one).

• Relationships have a name, descrip-


the policy was recorded in the data-
base. Effective date indicates when
coverage begins. Date of record is
a complete warehouse model either. tion, minimum (zero or one), and probably not interesting; effective
It’s better to take an incremental ap- maximum (one or many). date is certainly interesting; and date
proach and convert one subject area written may or may not be interesting.
Now you’re ready to convert subject
at a time according to business need. Relevance of each attribute to the
models to a tabular model. Here are
If you can’t find subject area models, warehouse is often a subjective
some of the more significant transfor-
now is a good matter. For example, the telephone
mations:

• Filter operational data.

InfoDB Volume 10 Number 3


5
number of a customer contact is prob-
ably useless, but the area code may
time
be useful for demographic analysis.
Special handling instructions for a
VIOLATION
package may not be relevant, but pri- LICENSE_NUMBER +
ority shipping status may be impor- DATETIME
SECTION_NUMBER
tant. In the interest of rapid BADGE_NUMBER event
WEATHER_CODE
development, I suggest you begin
with a minimal warehouse model and
extend it incrementally as business
EMPLOYEE_HISTORY
needs evolve. EMPLOYEE_ID +
Model historic data. Historic data DATE
JOB_TITLE
is one of the most significant differ- SALARY continuous
GRADE_CODE
ences between the subject and ware-
house models. In an operational
system, for example, you probably
MARRIAGE_HISTORY
require current zip code for each PERSON_ID +
customer. In the warehouse you will BEGIN_DATE
END_DATE
need all past zip codes to analyze pur- SPOUSE_FULL_NAME
LICENSE_NUMBER interval
chasing trends by region. The same RECORDING_STATE_CODE
applies to marital status and other
demographic information. Of course Figure 4. Three types of historic data
some entities in the subject model
are already historic. Our insurance
policy entity already has an effective subject models, past and present. reasons to include derived data — it
date, for example. These map directly Document the effective begin and end reduces query time at the cost of in-
to the warehouse model, as is. dates for each attribute in your dic- creased storage, and it makes the
There are three types of historic tionary. In the warehouse database, warehouse easier to understand and
data — event, continuous, and inter- columns which are new or expired use.
val (see Figure 4). If an entity has dis- will contain null values. You can Summarize transactions. Tables
crete or continuous data, add date or eliminate nulls by placing these col- that record business transactions are
datetime to the primary key in the umns in subset tables, but this may often quite large. Consider a credit
warehouse model. For segmented cause additional joins and slower que- card company, for example:
data you must add begin date to the ries. 10 million accounts
primary key, and end date as a Introduce derived data. The prob- 20 purchases per account per month
nonkey column. Use a special value lem of derived data is familiar to all 60 months of historic data
like December 31, 9999, to denote modelers. Derived data does not usu- 200 bytes per transaction
the end of an ongoing period. ally appear in the subject model, be- ——
Model changing metadata. It’s not cause it introduces redundancy to 2.4 terabytes
just the operational data that the operational data, which makes
changes. The metadata also changes. the database difficult to manage. Modern databases can process this
Your subject models evolve over time Data management is much simpler in amount of data, of course, but
as columns are added, removed, and the warehouse because it is not up- certain queries will run quite slowly.
revised. This is usually not a problem dated on-line, so redundancy is Now let’s look at a telephone
in the subject model, which is usually acceptable. company:
current, but it is a concern in the Introduce frequently computed 30 million customers
warehouse model. values to the warehouse model. For 10 calls per customer per day
The warehouse model must include example, if the model includes billing 1000 days of historic data
all (relevant) attributes from date and receipt date, you may also 100 bytes per call
record elapsed time. There are two ——
30 terabytes

InfoDB Volume 10 Number 3


6
As you can see, some transactional
databases may be too large to track EMPLOYEE PROJECT
over an extended period at the atomic l EMPLOYEE_ID EXECUTES l PROJECT_ID
SS_NUMBER 1(1) 1(0) DESCRIPTION
level. In this case you may want to re- FULL_NAME BUDGET_AMOUNT
place the detailed transactions with SPOUSE_ID ROLE_NAME
JOB_TYPE
summaries in the warehouse model. èDEPT_CODE

In the case of the credit card com-


pany, you might track the amount
spent by each customer per month. EMPLOYEE_PROJECT
This dramatically reduces the physi- EMPLOYEE_ DEPT_ BUDGET_ ROLE_
cal size of the warehouse. ID JOB_TYPE CODE PROJECT_ID DESCRIPTION AMOUNT NAME
e1 programmer d3 p1 xyz database 19,000 supervisor
It’s always better to carry detailed
e2 agent d2 null null null null
transactions forward from the subject e3 • • • agent d2 p2 gizmo test 48000 scientist
model to the warehouse. Sooner or e4 clerk d1 null null null null
e5 programmer d2 null null null null
later, warehouse users will need to e6 programmer d3 p2 widget sales 4000 manager
drill down to individual transactions,
to ferret out. You should only replace Figure 5. Merging tables connected by a one-to-one relationship
transactional data with summary
data when absolutely necessary.
You may wish to augment transac- ROW_WISE
tional data with summaries, either in
YEAR FRESH SOPH JUNIOR SENIOR GRAD
the data warehouse or data marts.
1981 143 102 205 257 50
This is an entirely different matter,
1982 187 99 163 120 38
since no information is lost. We’ll con-
sider this when we look at the data
mart model. COLUMN_WISE
Merge tables. If two tables fre-
YEAR CLASS ENROLLMENT
quently participate in a join query,
1981 frosh 143
you can improve performance by
1981 soph 102
merging them. The issues are
1981 junior 205
primarily performance — a merger op- 1981 senior 257
timizes queries that access both 1981 grad 50
tables, but slows queries that access 1982 frosh 187
just one. You’ll need to decide which 1982 soph 99
is more significant. As an alternative, 1982 junior 163
you can store both the merged and 1982 senior 120
separate tables. This is a data ware- 1982 grad 38
house; redundant and replicated data
Figure 6. Row-wise and column-wise design
are acceptable.
There are special considerations for
tables related by a one-one relation-
ship. In this case the merger will not • If the tables have different primary inexact — some rows may not
introduce redundancy or increase keys and their relationship is re- match up.
storage, but may introduce null val- quired, it is probably a good idea to
If the tables are related by a many-
ues (see Figure 5). There are several merge. By required relationship, I
one relationship, then merging the
cases to consider: mean that there is an exact corre-
tables introduces redundancy and in-
spondence between rows.
• If the tables have identical primary creases physical storage. This is a
keys, it is usually a good idea to • If the tables have different primary classic example of denormalization —
merge. The ISA relationship keys and their relationship is the merged table is in second normal
between superset and subset tables optional, it is probably a bad idea form. Although this practice is usu-
is an example of this situation. to merge. By optional relationship, I ally unwise in an operational data-
mean that the correspondence is

InfoDB Volume 10 Number 3


7
base, it is perfectly acceptable in the and the data mart is dimensional. sional foreign keys from the enter-
warehouse model. We’ll ignore the remaining two cases: prise warehouse, and insert this in-
Convert column-wise data. A vec- formation in the fact table.
• Dimensional enterprise ware-
tor is a fixed number of related val- Create artificial keys. A natural
houses are not common.
ues. For example, suppose we are key is a primary key that is known to
tracking enrollment at a university, • Tabular data marts are easy to de- the business user, such as social se-
and each year has exactly five enroll- rive from a tabular enterprise ware- curity number or department code.
ment figures. We can implement this house — just apply more tabular An artificial key is designed by the da-
vector either row-wise or column-wise transformations. tabase administrator and hidden
(See Figure 6). What are the tradeoffs? from the business user. Artificial keys
How do you map a tabular enter-
The column-wise design is usually are usually integers. They are usually
prise warehouse to a dimensional
preferred for operational systems, much smaller than natural keys. A
data mart? We’ll consider this prob-
because it is more flexible. It does not composite natural key like (POLICY-
lem in terms of a star schema, which
impose any limits on the number of NUMBER, DATETIME) may be twenty
implements a dimensional design in a
values in the vector, and will never bytes, for example. The artificial key
relational database. First, determine
have any null values. Row-wise de- for the same table may be four bytes.
the basic parameters of your star
sign requires much less physical Artificial keys are important in a
schema — business process, facts, di-
storage, which is significant for large star schema because the size of the
mensions, and levels — as outlined in
tables with many rows. It is also more fact table is critical to performance.
the sidebar, What is a Star Schema?
natural and understandable for users Large data marts may contain gigaby-
Next, apply these transformations to
— it looks like a spreadsheet. Of tes of rows. Replacing the natural key
the warehouse model:
course you can store data column- of a dimension table with an artificial
wise and convert it to a row-wise re- • Denormalize dimension tables. key can reduce row size by ten or
port, but the conversion is hard to twenty bytes per key. Introducing arti-
code and quite slow for large tables. • Introduce dimension keys directly ficial keys for all dimension tables
The advantages of row-wise design into a fact table. may trim the fact table by hundreds
are particularly important for reports • Create artificial keys for of gigabytes.
and read-only queries. For this rea- performance. Introducing artificial keys to an op-
son, it is often a good idea to convert erational system is a delicate deci-
column-wise designs to row-wise in • Eliminate detailed data from the sion, because it results in more joins.
the warehouse model. fact table. It’s easy in the star schema, however,
Restructure according to data because every query joins all dimen-
• Build the time dimension table.
volatility. Some warehouse data sion tables to the fact table regard-
changes frequently, some changes in- Denormalize dimension tables. less of the key.
frequently, and some not at all. Try to Collapse all information about each Eliminate detailed data. The en-
organize columns in tables according dimension into one table; this makes terprise warehouse usually includes
to volatility. This will speed up the the data mart easier to understand both detailed data and summary
batch load process and reduce the and use. This is a classic denormali- data. Detailed data is critical in the
amount of historic data. zation, and an extension of the merge warehouse because, sooner or later,
Reorganizing data in this way leads tables transformation from enterprise business users may want to drill
to smaller tables, which contradicts to warehouse model. However it is a down to atomic transactions to under-
the merge tables guideline. Usually distinct step since it applies specifi- stand anomalies and trends. If the
merging tables is more important be- cally to the dimension tables. data is not in the enterprise ware-
cause of the high cost of joining ta- Introduce dimension keys. In the house, it is not available.
bles, but you’ll have to use your enterprise warehouse, the relation- The requirements are different for a
judgment. ship between dimension and fact may dimensional data mart. Detailed data
be rather circuitous. Often a three- or is usually undesirable for three rea-
From Tabular Model to four-way join is necessary to derive sons:
Dimensional Model this information. In a star schema,
• A data mart is designed for a par-

A s you recall, we assume the en-


terprise warehouse is tabular
foreign keys to all dimensions appear
in the fact table, so that all join paths
are direct. In this step, derive dimen-
ticular group interested in a spe-
cific business process. You can
make an intelligent guess, in

InfoDB Volume 10 Number 3


8

What Is a Star Schema?


CHANNEL-ID CHANNEL-ID

A star schema is a relational


design with one fact table and
many dimension tables. The fact table
TYPE
Dimension Table
PROD#
DATE
PROD#
NAME
DATE SALES CATEGORY
contains one or several numeric col-
DAY-IN-WEEK UNITS COST
umns that measure business per-
formance. Examples are revenue, Fact Table SUPPLIER
MONTH-NAME
costs, number of products sold, PACKAGING
Dimension Table
number of shipments, and number of
passengers. Facts must be numeric, Star schema
since the purpose of the data mart is
to understand business performance
by summarizing data. It is not possi- a data mart. Physical size of dimen- multi-table design is probably eas-
ble to summarize textual information. sion tables is also insignificant, be- ier to maintain and use.6
It is possible to summarize numeric cause overall size of the data mart is Here is a procedure for designing
information only, with sums, aver- dominated by the fact table. Simplic- a star schema:
ages, and counts. ity is far more important than either • Select a business process for
Some fact tables represent events of these factors. analysis.
or assignments, such as a student at- Most dimensions are hierarchic. • Identify numeric measures of
tending a course or a product placed For example, time can be summa- performance, or facts, about this
on special promotion. There is no nu- rized by hour, day, week, month, process.
meric data in this case — we simply quarter, and year. Some dimensions • Determine important dimen-
want to know that an event or assign- contain several implicit hierarchies. sions for the facts.
ment took place. Events are usually Location can be summarized by • List the columns that describe
represented as a fact with value one. postal code, state, and country, or by each dimension.
The fact table is called factless. sales office, sales district, and region. • Determine the lowest level of
The dimension tables describe the When you design the fact table, you summary stored in the fact table.
facts. If your fact is number of prod- must determine the level of summary
ucts sold, the dimensions might be to store. At the lowest level, you can A star schema is essentially the
time, location, and product. Time store transactions like individual same as a hypercube design for a
describes when the product sold; loca- sales of specific products. However multidimensional tool. The fact
tion describes what store sold the the size of the fact table may be exces- table is logically equivalent to a
product; and product describes the sive, and you may not need this detail hypercube, except that the dimen-
product that was sold. Each dimen- anyway. You may want to summarize sions appear explicitly as values in
sion table contains many descriptive total revenue by day by product line. columns. One difference is that a
columns. The location dimension As Colin White points out in his data star schema may contain more di-
contains store name, city, state, and warehouse seminar,2 you can visual- mensional information.
store manager, for example. These ize this process by drawing a vertical There are significant differences
columns are usually textual, since line for each dimension, determining between tools that implement
they describe the data in the fact the level of summary on each line, these models. Multidimensional
table. and connecting the dots. (OLAP) and relational (ROLAP)
The dimension tables are usually To optimize performance of the star tools have different performance
highly denormalized, to simplify the schema, you may want to store facts characteristics and user inter-
design. You can normalize the dimen- at several levels simultaneously. For faces, for example. However these
sion tables to reduce their physical example, you might store revenue by differences have little to do with
size and eliminate redundancy. This day by product line by store, and by the logical structure of the star
is called a snowflake schema. How- quarter by product by region. You schema and hypercube models.
ever redundancy is unimportant can store each level in a separate fact
since there are no on-line updates in table, or in the same fact table. Al-
though either design is feasible, the

InfoDB Volume 10 Number 3


9
advance, as to the best summary • In a bottom-up approach, the data mapping implicitly includes incre-
level. marts spring up without control, mental transformations from
like weeds in your lawn. The enter- source to enterprise to warehouse
• If you guess wrong, you can derive
prise warehouse may evolve on top to dimensional model. However the
new summaries from the enterprise
of the data marts, but this is prob- increments are not explicitly docu-
warehouse.
lematic, because the data marts do mented, because enterprise and
• Detailed data increases the physi- not conform to a common model. warehouse models do not exist,
cal size of the star schema and and because the primary goal is
• In a hybrid approach, develop the
degrades query performance. quick results.
enterprise model first. Document it
Response time is critical, since the
at a high level, so you can complete • The hybrid approach is similar to
data mart is a management tool.
the first pass in four to six weeks. bottom-up, in that you document
When you eliminate detailed data You may model certain subject ar- separate maps for each data mart
from a dimensional data mart, you eas in more detail as warehouse de- and for the enterprise warehouse.
must allow users to drill down on velopment proceeds. When the There is a critical difference, how-
demand. This requires a multi-tier outline of the enterprise model is in ever — you explicitly document the
warehouse and query tools that work place, begin building the enterprise incremental steps of each map, from
with the enterprise warehouse as well warehouse and data marts in paral- source to enterprise to warehouse to
as data marts. lel. They dovetail nicely over time, mart. This enables you to perform
Build the time dimension table. since they are based on the same the integration at a later time.
In the warehouse, time is represented enterprise model. Allow roughly
In both the top-down and hybrid
with a datetime column in the key of three months to build a data mart.
approaches, you might transform the
historic tables. This isn’t good
The top-down approach has an source documentation directly to the
enough in a star schema. The time
obvious advantage — it leads to an warehouse model, without reference
dimension table includes descriptive
planned, integrated multi-tier solu- to the intermediate enterprise model.
columns like number of week,
tion. Unfortunately, it usually takes This reduces the amount of documen-
number of quarter, and holiday flag.
over a year and is relatively costly. tation. Data integration does not suf-
These descriptive attributes are deriv-
This is unacceptable in a competitive fer, as long as the warehouse model
able from datetime, but included in
business climate. Conversely, the bot- is derived from the enterprise model.
the dimension table to simplify
tom-up approach gives quick results
queries. Summary
and high return on investment, but
Technique eventually yields a disintegrated ware-
house. The hybrid solution enjoys the W e have looked at the problem of
modeling the warehouse in

T hese transformations show the


difference between source
documentation, enterprise model,
advantages of both, as long as the en-
terprise model is sufficiently devel-
three parts — model, mapping, and
methodology. From the modeling
oped in advance of the data perspective, the problem has four
warehouse model, and dimensional
warehouse and data mart databases. components:
model. They do not indicate a method-
Alur’s article addresses warehouse
ology for building the models. What is • Source documentation. This is
systems as a whole, but his ideas ap-
the process? the starting point for the modeling
ply to warehouse models as well:
In a recent article.1 Nagraj Alur process. It is not really a logical
presents several ways to build data • In a top-down approach, map mod- model, since it directly captures the
warehouses and marts: els incrementally from source to en- physical structure of source sys-
terprise to warehouse to data mart, tems.
• In the top-down approach, first
following the sequence outlined in
develop the enterprise data ware- • Enterprise model. In practice, this
this article. This matches the flow
house from the enterprise model or consists of many subject areas at
of data from source to date mart.
subject areas. Data marts follow various levels of completion. Don’t
after the enterprise warehouse is • In a bottom-up approach, each expect to build a complete, detailed
complete, and a multi-tier architec- data mart has a separate mapping. enterprise model.
ture evolves. The mappings go directly from
• Warehouse model. We emphasized
source documentation to the di-
enterprise data warehouses and ig-
mensional or tabular models. Each

InfoDB Volume 10 Number 3


10
nored operational data stores, since Acknowledgments
operational models are similar to
enterprise models. I would like to thank Colin White
and Nagraj Alur of DataBase Associ-
ates International for their assistance
Home
• Dimensional model. We empha-
sized star schema design and ig- with this article. A seminar by Colin
nored hypercube design, since White2 and an article by Nagraj Alur1
hypercubes are equivalent to star were particularly helpful. I would also
schema and less common. We also like to thank Dan Richards of Levi
ignored tabular data marts, since Strauss for his help in reviewing
these are simply subsets of the drafts and working out conceptual
warehouse model. problems.
The material in this article is
The key to warehouse design is extracted from a two-day seminar en-
mapping each model to the next. titled Modeling the Data Warehouse
Each map consists of characteristic and Data Mart by Paul Winsberg,
transformations: which will be available from Data-
Base Associates in the fall of this year.
• Source to enterprise. This is the
problem of reverse engineering,
References
which predates data warehouse
and is well-understood. 1. Nagraj Alur, “The Enterprise Data
Warehouse and Data Mart
• Enterprise to warehouse. Key Debate." InfoDB, April 1996.
transformations are filtering, model-
ing historic data and metadata, 2. Colin White, Building a Data
merging tables, deriving, summariz- Warehousing System – a two-day
ing, and column-to row-wise seminar available from DataBase
conversions. Associates International.

• Warehouse to data mart. Key 3. Djenan Ridjanovic and Robert


transformations are denormalizing Mantha, “Metadata Architecture
dimension tables, introducing for Legacy Data, Data Warehouse,
dimension keys to the fact table, and User Views.” White paper
creating artificial keys, eliminating from CSA.
derived data, and building the time 4. Bill Inmon, “Creating the Data
dimension. Some of these are Warehouse Data Model from the
specific applications of generic Corporate Data Model.” Prism
enterprise-to-warehouse Solutions Tech Topics, Volume 1,
transformations. Number 2.
From the methodology perspective, 5. Bill Inmon, “Defining the System of
you can develop these models and Record for the Data Warehouse.”
transformations in three ways: top- Prism Solutions Tech Topics,
down, bottom-up, and hybrid. The Volume 1, Number 3.
top-down approach is slow and
costly, and the bottom-up approach 6. Ralph Kimball, The Data Ware-
is disintegrated. I recommend the hy- house Toolkit. John Wiley & Sons,
brid approach; it is relatively quick 1996.
and practical, yet also integrated and
planned. The key is to develop a high-
level enterprise model first, and docu-
ment all data marts and the data
warehouse in terms of this model.

InfoDB Volume 10 Number 3

You might also like