Data Warehousing Interview Questions and Answers
Data Warehousing Interview Questions and Answers
What's A Data warehouse
Answer1:
A Data warehouse is a repository of integrated information, available for queries and analysis. Data
and information are extracted from heterogeneous sources as they are generated. This makes it much
easier and more efficient to run queries over data that originally came from different sources".
Another definition for data warehouse is: " A data warehouse is a logical collection of information
gathered from many different operational databases used to create business intelligence that supports
business analysis activities and decision-making tasks, primarily, a record of an enterprise's past
transactional and operational information, stored in a database designed to favour efficient data
analysis and reporting (especially OLAP)". Generally, data warehousing is not meant for current
"live" data, although 'virtual' or 'point-to-point' data warehouses can access operational data. A 'real'
data warehouse is generally preferred to a virtual DW because stored data has been validated and is
set up to provide reliable results to common types of queries used in a business.
Answer2:
Data Warehouse is a repository of integrated information, available for queries and analysis. Data and
information are extracted from heterogeneous sources as they are generated....This makes it much
easier and more efficient to run queries over data that originally came from different sources.
Typical relational databases are designed for on-line transactional processing (OLTP) and do not meet
the requirements for effective on-line analytical processing (OLAP). As a result, data warehouses are
designed differently than traditional relational databases.
What is ODS?
A dimensional table is a collection of hierarchies and categories along which the user can drill down
and drill up. it contains only the textual attributes.
A lookUp table is the one which is used when updating a warehouse. When the lookup is placed on
the target table (fact table / warehouse) based upon the primary key of the target, it just updates the
table by allowing only new records or updated records based on the lookup condition.
Why should you put your data warehouse on a different system than your OLTP system?
Answer1:
A OLTP system is basically " data oriented " (ER model) and not " Subject oriented "(Dimensional
Model) .That is why we design a separate system that will have a subject oriented OLAP system...
Moreover if a complex querry is fired on a OLTP system will cause a heavy overhead on the OLTP
server that will affect the daytoday business directly.
Answer2:
The loading of a warehouse will likely consume a lot of machine resources. Additionally, users may
create querries or reports that are very resource intensive because of the potentially large amount of
data available. Such loads and resource needs will conflict with the needs of the OLTP systems for
resources and will negatively impact those production systems.
Aggregate table contains the summary of existing warehouse data which is grouped to certain levels
of dimensions.Retrieving the required data from the actual table, which have millions of records will
take more time and also affects the server performance.To avoid this we can aggregate the table to
certain required level and can use it.This tables reduces the load in the database server and increases
the performance of the query and can retrieve the result very fastly.
Dimensional Modelling is a design concept used by many data warehouse desginers to build thier
datawarehouse. In this design model all the data is stored in two types of tables - Facts table and
Dimension table. Fact table contains the facts/measurements of the business and the dimension table
contains the context of measuremnets ie, the dimensions on which the facts are calculated.
Data modeling is probably the most labor intensive and time consuming part of the development
process. Why bother especially if you are pressed for time? A common response by practitioners who
write on the subject is that you should no more build a database without a model than you should
build a house without blueprints.
The goal of the data model is to make sure that the all data objects required by the database are
completely and accurately represented. Because the data model uses easily understood notations and
natural language , it can be reviewed and verified as correct by the end-users.
The data model is also detailed enough to be used by the database developers to use as a "blueprint"
for building the physical database. The information contained in the data model will be used to define
the relational tables, primary and foreign keys, stored procedures, and triggers. A poorly designed
database will require more time in the long-term. Without careful planning you may create a database
that omits data required to create critical reports, produces results that are incorrect or inconsistent,
and is unable to accommodate changes in the user's requirements.
Data mining is a process of extracting hidden trends within a datawarehouse. For example an
insurance dataware house can be used to mine data for the most high risk people to insure in a certain
geographial area.
What is ETL?
ETL provide developers with an interface for designing source-to-target mappings, ransformation
and job control parameter.
· Extraction
Take data from an external source and move it to the warehouse pre-processor database.
· Transformation
Transform data task allows point-to-point generating, modifying and transforming data.
· Loading
Load data task adds records to a database table in a warehouse.
What does level of Granularity of a fact table signify?
Granularity
The first step in designing a fact table is to determine the granularity of the fact table. By granularity,
we mean the lowest level of information that will be stored in the fact table. This constitutes two
steps:
OLTP: customer-oriented, used for data analysis and querying by clerks, clients and IT professionals.
OLAP: market-oriented, used for data analysis by knowledge workers( managers, executives,
analysis).
2. Data Contents
OLAP: manages large amounts of historical data, provides facilities for summarization and
aggregation, stores information at different levels of granularity to support decision making process.
3. Database Design
OLAP: adopts star, snowflake or fact constellation model and a subject-oriented database design.
4. View
OLAP: spans multiple versions of a database schema due to the evolutionary process of an
organization; integrates information from many organizational locations and data stores
Ex: a customer address modified we update existing record with new address.
A) Effective Date
B) Versions
C) Flags
or combination of these
SCD3: by adding new columns to target table we maintain historical information and current
information.
Why are OLTP database designs not generally a good idea for a Data Warehouse?
Since in OLTP,tables are normalised and hence query response will be slow for end user and OLTP
doesnot contain years of data and hence cannot be analysed.
BUS Schema is composed of a master suite of confirmed dimension and standardized definition if
facts.
1. MS-Excel
2. Business Objects (Crystal Reports)
3. Cognos (Impromptu, Power Play)
4. Microstrategy
5. MS reporting services
6. Informatica Power Analyzer
7. Actuate
8. Hyperion (BRIO)
9. Oracle Express OLAP
10. Proclarity
What is Normalization, First Normal Form, Second Normal Form , Third Normal Form?
Fact Table contains the measurements or metrics or facts of business process. If your business process
is "Sales" , then a measurement of this business process such as "monthly sales number" is captured in
the Fact table. Fact table also contains the foriegn keys for the dimension tables.
Answer1:
Conformed dimensions mean the exact same thing with every possible fact table to which they are
joined Ex:Date Dimensions is connected all facts like Sales facts,Inventory facts..etc
Answer2:
Conformed dimentions are dimensions which are common to the cubes.(cubes are the schemas
contains facts and dimension tables)
Consider Cube-1 contains F1,D1,D2,D3 and Cube-2 contains F2,D1,D2,D4 are the Facts and
Dimensions here D1,D2 are the Conformed Dimensions
Conventional Load:
Before loading the data, all the Table constraints will be checked against the data.
Conformed dimensions are the dimensions which can be used across multiple Data Marts in
combination with multiple facts tables accordingly
Data Marts are designed to help manager make strategic decisions about their business.
Data Marts are subset of the corporate-wide data that is of value to a specific group of users.
1.Independent data marts – sources from data captured form OLTP system, external providers or
from data generated locally within a particular department or geographic area.
Level of granularity means level of detail that you put into the fact table in a data warehouse. For
example: Based on design you can decide to put the sales data in each transaction. Now, level of
granularity would mean what detail are you willing to put for each transactional fact. Product sales
with respect to each minute or you want to aggregate it upto minute and put that data.
Most dimension tables are designed using Normalization principles upto 2NF. In some instances they
are further normalized to 3NF.
Determine how to maintain changes to this dimension (see more on this in the next section).
On the fact table it is best to use bitmap indexes. Dimension tables can use bitmap and/or the other
types of clustered/non-clustered, unique/non-unique indexes.
Snowflake Schema, each dimension has a primary dimension table, to which one or more additional
dimensions can join. The primary dimension table is the only table that can join to the fact table