Unit 2
Unit 2
Unit-2
• Schema in general terms means “Logical Structure”. So, data warehouse schema describes the
logical structure of any data warehouse containing records.
• Also, the concept behind schema of data warehouse is same as that in data bases. Relational data
models are used by data bases for their logical structure while data warehouses uses schema for
the same purpose.
• The schema in data warehouses are used to get the knowledge of complexity of a structure of
data warehouse.
• They are basically the representation of the outer model or the way to logically deduce the
results from the figure and these figures are made from combinations of fact tables and
dimension tables.
• The conceptual modelling of warehouse comprises of three models. These are:-
Dimensional Modeling
Dimensional modeling represents data with a cube operation, making more suitable logical
data representation with OLAP data management. The perception of Dimensional Modeling was
developed by Ralph Kimball and is consist of "fact" and "dimension" tables.
In dimensional modeling, the transaction record is divided into either "facts," which are
frequently numerical transaction data, or "dimensions," which are the reference information that
gives context to the facts. For example, a sale transaction can be damage into facts such as the
number of products ordered and the price paid for the products, and into dimensions such as order
date, user name, product number, order ship-to, and bill-to locations, and salesman responsible for
receiving the order.
Fact Table
Fact tables are used to data facts or measures in the business. Facts are the numeric data
elements that are of interest to the company.
Characteristics of the Fact table
The fact table includes numerical values of what we measure. For example, a fact value of
20 might means that 20 widgets have been sold. Each fact table includes the keys to associated
dimension tables. These are known as foreign keys in the fact table.
Fact tables typically include a small number of columns. When it is compared to dimension
tables, fact tables have a large number of rows.
Dimension Table
Dimension tables establish the context of the facts. Dimensional tables store fields that
describe the facts.
Characteristics of the Dimension table
Dimension tables contain the details about the facts. That, as an example, enables the
business analysts to understand the data and their reports better.
The dimension tables include descriptive data about the numerical values in the fact table.
That is, they contain the attributes of the facts. For example, the dimension tables for a marketing
analysis function might include attributes such as time, marketing region, and product type. Since
the record in a dimension table is denormalized, it usually has a large number of columns. The
dimension tables include significantly fewer rows of information than the fact table. The attributes
in a dimension table are used as row and column headings in a document or query results display.
Example: A city and state can view a store summary in a fact table. Item summary can be viewed
by brand, color, etc. Customer information can be viewed by name and address.
Fact Table
4 17 2 1
8 21 3 2
8 4 1 1
In this example, Customer ID column in the facts table is the foreign keys that join with the
dimension table. By following the links, we can see that row 2 of the fact table records the fact that
customer 3, Gaurav, bought two items on day 8.
Dimension Tables
1 Rohan Male 2 3 4
2 Sandeep Male 3 5 1
3 Gaurav Male 1 7 3
Hierarchy
A hierarchy is a directed tree whose nodes are dimensional attributes and whose arcs model
many to one association between dimensional attributes team. It contains a dimension, positioned at
the tree's root, and all of the dimensional attributes that define it.
A star schema is the elementary form of a dimensional model, in which data are organized
into facts and dimensions. A fact is an event that is counted or measured, such as a sale or log in. A
dimension includes reference data about the fact, such as date, item, or customer.
A star schema is a relational schema where a relational schema whose design represents a
multidimensional data model. The star schema is the explicit data warehouse schema. It is known as
star schema because the entity-relationship diagram of this schemas simulates a star, with points,
diverge from a central table. The center of the schema consists of a large fact table, and the points of
the star are the dimension tables.
Fact Tables
A table in a star schema which contains facts and connected to dimensions. A fact table has
two types of columns: those that include fact and those that are foreign keys to the dimension table.
The primary key of the fact tables is generally a composite key that is made up of all of its foreign
keys.
A fact table might involve either detail level fact or fact that have been aggregated (fact
tables that include aggregated fact are often instead called summary tables). A fact table generally
contains facts with the same level of aggregation.
Dimension Tables
A dimension is an architecture usually composed of one or more hierarchies that categorize
data. If a dimension has not got hierarchies and levels, it is called a flat dimension or list. The
primary keys of each of the dimensions table are part of the composite primary keys of the fact
table. Dimensional attributes help to define the dimensional value. They are generally descriptive,
textual values. Dimensional tables are usually small in size than fact table.
Fact tables store data about sales while dimension tables data about the geographic region
(markets, cities), clients, products, times, channels.
For example − The item dimension table in a star schema is normalized and split into two
dimension tables, namely item and supplier table. Now the item dimension table contains the
attributes item_key, item_name, type, brand, and supplier-key.
The supplier key is linked to the supplier dimension table. The supplier dimension table
contains the attributes supplier_key and supplier_type.
Note − Due to the normalization in the Snowflake schema, the redundancy is reduced and therefore,
it becomes easy to maintain and the save storage space.
Aggregate tables
Aggregate tables, in general, are simply database tables that contain aggregated values.
Imagine that you have a fact table like this in which the granularity is date, product and customer:
Every value that you see for "unit sales" is the number of units of a particular product that
were sold to a particular customer on a particular day. (Here, to help make the point of aggregation
slightly clearer, I've shown the day of sale as an actual date rather than as a pointer. And yes, I
realize that these European-style dates are still in the future – but we'll treat them as the past for
purposes of this demonstration.)
We could run queries against this fact table, and it would return data. For example, we could
ask for the total unit sales of Item No. 150 to Customer 1 on May 4, 2012, and the answer we'd get
is three.
We could also run a query that returns the total sales of Item No. 150 to the same customer
not just for the 4th of May but for that entire month. In order to do this, the system would scan the
fact table looking for 31 separate date entries listing sales of that product to Customer 1 and then
aggregate the unit-sale values from the returned rows. That would be a relatively slow process.
An alternative is to create a fact table that already contains one or more levels of
aggregation. For example, we could aggregate this fact table by month. That would involve finding
all the sales of Item No. 150 to Customer 1 in, say, January 2012, aggregating the data, and putting
the results in a single row. We then, of course, would have to do the same for the other months, the
other products and the other customers.
The new table would look like this:
Now when we had a query that looked for a monthly sales total, we could run it against the
aggregate fact table and find the answer much more rapidly. And of course, this isn't the only
aggregate fact table that we could generate. For example, if we knew the counties that different
customers were located in, we could aggregate their data on that basis.
We could also aggregate by both date and customer up to the level of the month and county,
as in this table excerpt (with Herefordshire being a county in England, for the uninitiated):
In other words, we could create a series of aggregate fact tables, and when a query comes in,
we could run it against the appropriate aggregated table.
Data Extraction
Data extraction is the process of collecting or retrieving disparate types of data from a
variety of sources, many of which may be poorly organized or completely unstructured. Data
extraction makes it possible to consolidate, process, and refine data so that it can be stored in a
centralized location in order to be transformed. These locations may be on-site, cloud-based, or a
hybrid of the two.
Data extraction is the first step in both ETL (extract, transform, load) and ELT (extract, load,
transform) processes. ETL/ELT are themselves part of a complete data integration strategy.
Data Extraction and ETL
To put the importance of data extraction in context, it’s helpful to briefly consider the ETL
process as a whole. In essence, ETL allows companies and organizations to
1) consolidate data from different sources into a centralized location and
2) assimilate different types of data into a common format.
There are three steps in the ETL process:
1. Extraction: Data is taken from one or more sources or systems. The extraction locates and
identifies relevant data, then prepares it for processing or transformation. Extraction allows
many different kinds of data to be combined and ultimately mined for business intelligence.
2. Transformation: Once the data has been successfully extracted, it is ready to be refined.
During the transformation phase, data is sorted, organized, and cleansed. For example, duplicate
entries will be deleted, missing values removed or enriched, and audits will be performed to
produce data that is reliable, consistent, and usable.
3. Loading: The transformed, high quality data is then delivered to a single, unified target
location for storage and analysis.
Data Transformation
Data transformation is the process in which data gets converted from one format to another.
The most common data transformation process involves collecting raw data and converting it into
clean, usable data.
Data transformation increases the efficiency of business and analytic processes, and it
enables businesses to make better data-driven decisions. During the data transformation process, an
analyst will determine the structure of the data. This could mean that data transformation may be:
• Constructive: The data transformation process adds, copies, or replicates data.
• Destructive: The system deletes fields or records.
• Aesthetic: The transformation standardizes the data to meet requirements or parameters.
• Structural: The database is reorganized by renaming, moving, or combining columns.
They’ll also perform data mapping and extract the data from its original source before
executing the transformation. Finally, they’ll store the transformed data within the appropriate
database technology.
Basic Tasks
The major tasks performed during this phase vary depending on the application; however
the basic tasks are discussed here.
Selection: This takes place at the beginning of the whole process of data transformation. You select
either whole records or parts of several records from the source systems. The task of selection
usually forms part of the extraction function itself. However, in some cases, the composition of the
source structure may not be supporting selection of the necessary parts during data extraction. In
these cases, it is advised to extract the whole record and then do the selection
as part of the transformation function.
Splitting/joining: This task includes the types of data manipulation you need to perform on the
selected parts of source records. Sometimes (uncommonly), you will be splitting the selected parts
even further during data transformation. Joining of parts selected from many source systems is more
widespread in the data warehouse environment.
Conversion: This is an all-inclusive task. It includes a large variety of rudimentary conversions of
single fields for two primary reasons (i) to standardize among the data extractions from disparate
source systems, and (ii) to make the fields usable and understandable to the users.
Summarization: Sometimes you may find that it is not feasible to keep data at the lowest level of
detail in your data warehouse. It may be that none of your users ever need data at the lowest
granularity for analysis or querying. For example, for a grocery chain, sales data at the lowest level
of detail for every transaction at the checkout may not be needed. Storing sales by product by store
by day in the data warehouse may be quite adequate. So, in this case, the data transformation
function includes summarization of daily sales by product and by store.
Enrichment: This task is the rearrangement and simplification of individual fields to make them
more useful for the data warehouse environment. You may use one or more fields from the same
input record to create a better view of the data for the data warehouse. This principle is extended
when one or more fields originate from multiple records, resulting in a single field for the data
warehouse.
To better understand lets discuss conversion and enrichment with examples.
Data Transformation Basic Tasks: Conversion (example)
1. Format Revision
Format revisions fix problems that stem from fields having different data types. Some fields
might be numeric, and others might be text. One data system could treat text versus numeric
information differently, so you might have to standardize the formats to integrate source data with
the target data schema. This could involve the conversion of male/female, date/time, measurements,
and other information into a consistent format.
Field lengths can also be an issue—especially if the target schema has smaller character
limits. In these cases, it may be necessary to standardize the length of fields by breaking up long
serial numbers into smaller parts and putting them into separate columns.
Additionally, format revision could involve splitting up a comma-separated list of words or
numbers into multiple columns.
2. Data Derivation
Data derivation involves the creation of special rules to “derive” the specific information
you want from the data source. For example, you might have a database that includes total revenue
data from sales, but you’re only interested in loading the profit figures after subtracting costs and
tax liabilities. Data derivation allows you to create a set of transformation rules that subtract costs
and taxes from the total revenue information.
3. Data Splitting
Data splitting refers to dividing a single column into multiple columns. This is critical for
analyzing the available data; splitting the single column into multiple columns can be useful to
develop "training" and "testing" sets, for example. The "training" gets used for experimental
analysis and making models, while the "testing" set is the untouched, "control" element. Data
splitting can be helpful with a large amount of data gathered over a significant amount of time.
4. Data Joining
Joining data is one of the most important functions of data transformation. A “join” is an
operation in the SQL database language that allows you to connect two or more database tables by
their matching columns. This allows you to establish a relationship between multiple tables, which
merges table data together so you can query correlating data on the tables.
5. Data Summarization
It refers to the creation of different business metrics through the calculation of value totals.
You could sum up the total revenue of all the sales made by the individual salespeople on your staff,
then create sales metrics that reveal total sales for individual time periods.
6. Key Restructuring
When the tables in a data warehouse have keys with built-in meanings, serious problems can
develop. For example, if a client phone number serves as a primary key, changing the phone
number in the original data source means that the number would have to change everywhere it
appears in the data system. That would cause a cascade of updates that over-burden or slow down
the system.
Through key restructuring, you can transform any keys with built-in meanings to generic
keys—i.e., random numbers that reference back to the source database with the actual information.
By drawing key connections from one table to another, key restructuring optimizes the data
warehouse for speed and efficiency.
7. Data Deduplication
Data deduplication is a data compression process where you identify and remove duplicate
or repeated copies of information. Also referred to as single-instance storage, intelligent
compression, commonality factoring, or data reduction, deduplication allows you to store one
unique copy of data in your data warehouse or database.
The deduplication process analyzes incoming data and compares it to data that's already
stored in the system. If the data is already there, deduplication algorithms delete the duplicate
information while creating a reference to it. If you upload a changed version of a previous file, the
system will back up a said file while adding the changes to the data segment. Deduplication
algorithms also keep track of outgoing data to delete duplicates, which speeds up the information
transfer process.
OLAP
DIMENSIONAL ANALYSIS
One approach to data warehouse design is to develop and implement a dimensional model.
This has given rise to dimensional analysis (sometimes generalized as multi-dimensional analysis ).
It was noticed quite early on when data warehouses started to be developed that, whenever decision
makers were asked to describe the kinds of questions they would like to get answers to regarding
their organizations, they almost always wanted the following:
◦ Summarized information with the ability to break the summaries into more detail
◦ Analysis of the summarized information across their own organizational components such
as departments or regions
◦ Ability to slice and dice the information in any way they chose
◦ Display of the information in both graphical and tabular form
◦ Capability to view their information over time
So as an example, they might wish to see a report showing Wine Sales by Product, or a
report showing Sales by Customer, or even Sales by Product by Customer
Table shows a typical report of sales by product.
Table The Wine Club ” Analysis of Sales by Product for July.
Product Quantity Sold Selling Total
Cost Price Total Cost Gross Profit
Name (Cases) Price Revenue
Chianti 321 26.63 42.95 13,787 8,548 5,239
Bardolino 1,775 15.10 31.35 55,646 26,802 28,844
Barolo 275 46.72 70.95 19,511 12,848 6,663
Lambrusco 1,105 23.25 41.45 45,802 25,691 20,111
Valpolicella 2,475 12.88 32.45 80,313 31,878 48,435
This dimensional approach led Ted Codd to make the following observation:
There are typically a number of different dimensions from which a given pool of data can be
analyzed . This plural perspective, or multidimensional conceptual view, appears to be the
way most business persons naturally view their enterprise.
”E. F. Codd, 1993
So the concept of dimensional analysis became a method for defining data warehouses.
The approach is to determine, by interviewing the appropriate decision makers in an
organization, which is the subject area that they are most interested in, and which are the most
important dimensions of analysis.
Recall that one of the characteristics of a data warehouse is that it is subject oriented. The
subject area reflects the subject-oriented nature of the warehouse.
In the example above, the subject area would be Sales. The dimensions of analysis would be
Customers and Products. The requirement is to analyze sales by customer and sales by product.
This requirement is depicted in the following three-dimensional cube. Figure shows Sales (the
shaded area) having axes of:
1. Customer
2. Product
3. Time
The diagram in Figure is often referred to as a Star Schema because the diagram loosely
resembles a star shape. The subject area is the center of the star and the dimensions of analysis form
the points of the star. The subject area is often drawn long and thin because the table itself is usually
long and thin in that it contains a small number of columns but a very large number of rows.
The Star Schema is the most commonly used diagram for dimensional models.
Hypercube
At the core of the OLAP concept, is an OLAP Cube. The OLAP cube is a data structure
optimized for very quick data analysis.
The OLAP Cube consists of numeric facts called measures which are categorized by
dimensions. OLAP Cube is also called the hypercube.
Usually, data operations and analysis are performed using the simple spreadsheet, where
data values are arranged in row and column format. This is ideal for two-dimensional data.
However, OLAP contains multidimensional data, with data usually obtained from a different and
unrelated source. Using a spreadsheet is not an optimal option. The cube can store and analyze
multidimensional data in a logical and orderly manner.
OLAP helps in analyzing different extracts and in viewing business data from different
points of view. It is often required to group aggregate and join data. The structure is basically called
the OLAP cube. The OLAP cube is a data structure that is optimized for proper data analysis. It
mainly consists of numeric facts which can be called dimensions, at the same time where the OLAP
cubes are termed as ‘Hyper cubes’, which will allow the user to perform Multidimensional
Analytical querying for the required data using the basic OLAP operations such as Drill-down,
Roll-up, Slicing, Dicing and Pivot.
There are four types of OLAP operations that can be performed. These areas below:
1. Roll up
2. Drill down
3. Slice and dice
4. Pivot
Roll-Up
Temperature 64 65 68 69 70 71 72 75 80 81 83 85
Week1 1 0 1 0 1 0 0 0 0 0 1 0
Week2 0 0 0 1 0 0 1 2 0 1 0 0
Consider that we want to set up levels (hot (80-85), mild (70-75), cool (64-69)) in
temperature from the above cubes. To do this, we have to group column and add up the value
according to the concept hierarchies. This operation is known as a roll-up.
By doing this, we contain the following cube:
Week1 2 1 1
Week2 2 1 1
Drill-Down
The drill-down operation (also called roll-down) is the reverse operation of roll-up. Drill-
down is like zooming-in on the data cube. It navigates from less detailed record to more detailed
data. Drill-down can be performed by either stepping down a concept hierarchy for a dimension or
adding additional dimensions.
Figure shows a drill-down operation performed on the dimension time by stepping down a
concept hierarchy which is defined as day, month, quarter, and year. Drill-down appears by
descending the time hierarchy from the level of the quarter to a more detailed level of the month.
Because a drill-down adds more details to the given data, it can also be performed by adding a new
dimension to a cube. For example, a drill-down on the central cubes of the figure can occur by
introducing an additional dimension, such as a customer group.
Example
Drill-down adds more details to the given data
Day 1 0 0 0
Day 2 0 0 0
Day 3 0 0 1
Day 4 0 1 0
Day 5 1 0 0
Day 6 0 0 0
Day 7 1 0 0
Day 8 0 0 0
Day 9 1 0 0
Day 10 0 1 0
Day 11 0 1 0
Day 12 0 1 0
Day 13 0 0 1
Day 14 0 0 0
Slice
A slice is a subset of the cubes corresponding to a single value for one or more members of
the dimension. For example, a slice operation is executed when the customer wants a selection on
one dimension of a three-dimensional cube resulting in a two-dimensional site. So, the Slice
operations perform a selection on one dimension of the given cube, thus resulting in a subcube.
For example, if we make the selection, temperature=cool we will obtain the following cube:
Temperature cool
Day 1 0
Day 2 0
Day 3 0
Day 4 0
Day 5 1
Day 6 1
Day 7 1
Day 8 1
Day 9 1
Day 11 0
Day 12 0
Day 13 0
Day 14 0
Here Slice is functioning for the dimensions "time" using the criterion time = "Q1".
It will form a new sub-cubes by selecting one or more dimensions.
Dice
The dice operation describes a subcube by operating a selection on two or more dimension.
For example, Implement the selection (time = day 3 OR time = day 4) AND (temperature = cool
OR temperature = hot) to the original cubes we get the following subcube (still two-dimensional)
Day 3 0 1
Day 4 0 0
Pivot
The pivot operation is also called a rotation. Pivot is a visualization operations which rotates
the data axes in view to provide an alternative presentation of the data. It may contain swapping the
rows and columns or moving one of the row-dimensions into the column dimensions.
Consider the following diagram, which shows the pivot operation.
Relational OLAP
ROLAP servers are placed between relational back-end server and client front-end tools. To
store and manage warehouse data, ROLAP uses relational or extended-relational DBMS.
ROLAP includes the following −
• Implementation of aggregation navigation logic.
• Optimization for each DBMS back end.
• Additional tools and services.
•
Multidimensional OLAP
MOLAP uses array-based multidimensional storage engines for multidimensional views of
data. With multidimensional data stores, the storage utilization may be low if the data set is sparse.
Therefore, many MOLAP server use two levels of data storage representation to handle dense and
sparse data sets.
Hybrid OLAP
Hybrid OLAP is a combination of both ROLAP and MOLAP. It offers higher scalability of
ROLAP and faster computation of MOLAP. HOLAP servers allows to store the large data volumes
of detailed information. The aggregations are stored separately in MOLAP store.
Other OLAP models are
In Desktop OLAP, a user downloads a part of the data from the database
Desktop OLAP locally, or on their desktop and analyze it.
(DOLAP) DOLAP is relatively cheaper to deploy as it offers very few functionalities
compares to other OLAP systems.
Web OLAP which is OLAP system accessible via the web browser.
Web OLAP
WOLAP is a three-tiered architecture. It consists of three components:
(WOLAP)
client, middleware, and a database server.
Mobile OLAP helps users to access and analyze OLAP data using their
Mobile OLAP
mobile devices
A MOLAP system is based on a native logical model that directly supports multidimensional
data and operations. Data are stored physically into multidimensional arrays, and positional
techniques are used to access them.
One of the significant distinctions of MOLAP against a ROLAP is that data are
summarized and are stored in an optimized format in a multidimensional cube, instead of in a
relational database. In MOLAP model, data are structured into proprietary formats by client's
reporting requirements with the calculations pre-generated on the cubes.
MOLAP Architecture
MOLAP Architecture includes the following components
• Database server.
• MOLAP server.
• Front-end tool.
MOLAP structure primarily reads the precompiled data. MOLAP structure has limited capabilities
to dynamically create aggregations or to evaluate results which have not been pre-calculated and
stored.
Applications requiring iterative and comprehensive time-series analysis of trends are well
suited for MOLAP technology (e.g., financial analysis and budgeting).
Examples include Arbor Software's Essbase. Oracle's Express Server, Pilot Software's
Lightship Server, Sniper's TM/1. Planning Science's Gentium and Kenan Technology's Multiway.
Some of the problems faced by clients are related to maintaining support to multiple subject
areas in an RDBMS. Some vendors can solve these problems by continuing access from MOLAP
tools to detailed data in and RDBMS.
This can be very useful for organizations with performance-sensitive multidimensional
analysis requirements and that have built or are in the process of building a data warehouse
architecture that contains multiple subject areas.
An example would be the creation of sales data measured by several dimensions (e.g.,
product and sales region) to be stored and maintained in a persistent structure. This structure would
be provided to reduce the application overhead of performing calculations and building aggregation
during initialization. These structures can be automatically refreshed at predetermined intervals
established by an administrator.
Advantages
Excellent Performance: A MOLAP cube is built for fast information retrieval, and is optimal for
slicing and dicing operations.
Can perform complex calculations: All evaluation have been pre-generated when the cube is
created. Hence, complex calculations are not only possible, but they return quickly.
Disadvantages
Limited in the amount of information it can handle: Because all calculations are performed
when the cube is built, it is not possible to contain a large amount of data in the cube itself.
Requires additional investment: Cube technology is generally proprietary and does not already
exist in the organization. Therefore, to adopt MOLAP technology, chances are other investments in
human and capital resources are needed.