Data Warehousing and Data Mining - Unit2
Data Warehousing and Data Mining - Unit2
Mining
Unit 2
1
Data Warehousing
• Data
– Raw piece of information that is capable of being moved and
store.
• Database
– An organized collection of such data in which data are managed
in tabular form with relationship.
• Data Warehouse
– System that organizes all the data available in an organization,
makes it accessible & usable for the all kinds of data analysis
and also allows to create a lots of reports by the use of mining
tools.
2
Data Warehouse
– “A data warehouse is a subject-oriented,
integrated, time-variant, and nonvolatile
collection of data in support of management’s
decision-making process.”
• Data warehousing:
– The process of constructing and using data
warehouses.
– Is the process of extracting & transferring
operational data into informational data & loading
it into a central data store (warehouse)
3
Data Warehouse—Integrated
• Constructed by integrating multiple,
heterogeneous data sources Sales
– relational databases, flat files, on-line system
transaction records
• Data cleaning and data integration
techniques are applied. Payroll
Customer
– Ensure consistency in naming conventions, system
data
encoding structures, attribute measures, etc.
among different data sources
• E.g., Hotel price: currency, tax, breakfast Purchasing
covered, etc. system
4
Data Warehouse—Subject-Oriented
• Organized around major subjects, such as
customer, product, sales.
Sales Employee
• Focusing on the modeling and analysis of system data
data for decision makers, not on daily
operations or transaction processing. Payroll Customer
system data
• Provide a simple and concise view around
particular subject issues by excluding data Vendor
Purchasing data
that are not useful in the decision support system
process.
Operational data DW
5
Data Warehouse—Time Variant
6
Data Warehouse—Non-Volatile
• A physically separate store of data transformed
DBMS DW
from the operational environment.
create access
• Operational update of data does not occur in
the data warehouse environment.
– Does not require transaction processing, update
Sales delete Customer
recovery, and concurrency control mechanisms system data
– Requires only two operations in data accessing:
• initial loading of data and access of data.
insert load
7
Data Warehouse Usage
• Three kinds of data warehouse applications
– Information processing
• supports querying, basic statistical analysis, and reporting
using crosstabs, tables, charts and graphs
– Analytical processing
• multidimensional analysis of data warehouse data
• supports basic OLAP operations, slice-dice, drilling, pivoting
– Data mining
• knowledge discovery from hidden patterns
• supports associations, constructing analytical models,
performing classification and prediction, and presenting the
mining results using visualization tools.
8
General Architecture
OLAP
External Data Server Data
Sources
acquisition extraction OLAP
queries/
Query reports
Data and
Integration Data Data Analysis
Component Warehouse Component
data
mining
Metadata
Internal Monitoring
Sources Administration
Construction &
maintenance 9
3 main phases
• Data acquisition
– relevant data collection
– Recovering: transformation into the data warehouse model from
existing models
– Loading: cleaning and loading in the DWH
• Storage
• Data extraction
– Tool examples: Query report, SQL, multidimensional analysis (OLAP
tools), datamining
• Maintenance
10
DATA WAREHOUSING
THE USE OF A DATA WAREHOUSE
INVENTORY
DATABASE STEP 1: Load the Data Warehouse
DATA
NEWCASTLE
SALES DB WAREHOUSE
LONDON
SALES DB
DW
Data marts
13
Why Separate Data Warehouse?
• High performance for both systems
– DBMS— tuned for OLTP: access methods, indexing,
concurrency control, recovery
– Warehouse—tuned for OLAP: complex OLAP queries,
multidimensional view, consolidation(aggregation).
• Different functions and different data:
– missing data: Decision support requires historical data
which operational DBs do not typically maintain
– data consolidation: Decision Support requires consolidation
(aggregation, summarization) of data from heterogeneous
sources
– data quality: different sources typically use inconsistent
data representations, codes and formats
14