0% found this document useful (0 votes)
15 views33 pages

Unit 2 DW&DM Notes Mr. Rohit Pratap Singh

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views33 pages

Unit 2 DW&DM Notes Mr. Rohit Pratap Singh

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

Unit 2

DATA WAREHOUSING SCHEMAS

Schema is a structure that represents how entities and attributes


are inter connected with each other in a database. As like other
other databases, data warehouses also manages schema. A
database use relational model whereas data warehouse uses star,
snowflake, and fact constellations.
Schemas in Data Warehousing:

a. Star
b. Snow Flake
c. Fact Constellations

[Star, snow flake already studied in chapter - 1]

FACT CONSTELLATIONS

1. It also helps in representing multidimensional model.


2. Itis a collection of multiple facts tables having some
common dimension tables.
3. It can be viewed as a collection of several star schemas and
hence, also known as Galaxy schema.
4. It is more complex than start and snowflake schema
Client/Server Computing model and data
warehousing

In client server computing, the clients requests a resource and the


server provides that resource. A server may serve multiple clients
at the same time while a client is in contact with only one server.
Both the client and server usually communicate via a computer
network but sometimes they may reside in the same system

• The client server computing works with a system of request


and response. The client sends a request to the server and the
server responds with the desired information.
• The client and server should follow a common
communication protocol so they can easily interact with
each other. All the communication protocols are available at
the application layer.
• A server can only accommodate a limited number of client
requests at a time. So it uses a system based to priority to
respond to the requests.
• Denial of Service attacks hinders servers ability to respond
to authentic client requests by inundating it with false
requests.
• An example of a client server computing system is a web
server. It returns the web pages to the clients that requested
them.

Hardware and Operating Systems for Data Warehousing


Data warehouses are normally very concerned with I/O performance. This is in contrast
to OLTP systems

Hardware Considerations
1. Processing Power:
• Multi-core processors: Data warehouses benefit from parallel processing capabilities
offered by multi-core CPUs.
• High clock speeds: Faster processors can handle complex queries and data
transformations more efficiently.
2. Memory (RAM):
• Sufficient RAM allows for faster data access and query processing.
• In-memory processing: Consider systems with large amounts of RAM or in-memory
databases for improved performance.
3. Storage:
• High-performance storage: Use solid-state drives (SSDs) or high-speed storage arrays
for fast data access.
• Scalable storage: Ensure the storage system can scale with growing data volumes.
4. Network:
• High-speed network connections: Fast network infrastructure minimizes data transfer
latency between components of the data warehouse architecture.
• Redundancy: Implement redundant network connections to ensure high availability
and fault tolerance.
5. Scalability:
• Scalable architecture: Choose hardware that supports horizontal scaling to
accommodate growing data and user loads.
• Distributed processing: Consider distributed computing frameworks like Apache
Hadoop or Spark for scalable processing.
6. Data Redundancy and Fault Tolerance:
• RAID configurations: Use RAID (Redundant Array of Independent Disks) for data
redundancy and fault tolerance.
• Backup systems: Implement regular backups and disaster recovery solutions to protect
against data loss.
7. Hardware Acceleration:
• GPU acceleration: Graphics processing units (GPUs) can accelerate certain data
processing tasks, such as machine learning algorithms and complex analytics.
Operating System Considerations
1. Compatibility:
• Ensure compatibility with the chosen database management system (DBMS) and other
software components of the data warehouse stack.
2. Performance:
• Choose operating systems known for stability, performance, and reliability.
• Linux distributions like CentOS, Red Hat Enterprise Linux (RHEL), or Ubuntu Server are
popular choices for data warehousing due to their stability and performance.
3. Security:
• Select an operating system with robust security features and regular updates to protect
against vulnerabilities.
• Implement access controls, firewalls, and encryption to secure data and infrastructure.
4. Manageability:
• Choose an operating system with robust management tools and support for
automation.
• Consider systems with centralized management capabilities for easier administration of
multiple servers.
5. Compatibility with Tools and Software:
• Ensure compatibility with data warehousing software, ETL tools, monitoring tools, and
other components of the data warehouse ecosystem.
6. Scalability and Resource Management:
• Operating systems should support resource management features like process
scheduling, memory management, and disk I/O optimization to ensure efficient
resource utilization.
7. Virtualization and Containerization:
• Consider virtualization or containerization technologies like VMware, Docker, or
Kubernetes for flexible deployment and resource allocation.
Warehousing Strategy
WAREHOUSE MANAGEMENT AND SUPPORTPROCESS

Warehouse management involves planing, designing, developing,


implementing, and maintaining data warehouse project in order to
manage warehouse and subsequent activities efficiently.
Datawarehouse processes involves 6 processes:

1. Receiving: Receiving involves the process of transfer of the


goods to the warehouse. Warehouse must have ability to
verify that it has received right product in right quantity in
right condition and at right time. This process must be
performed right to ensure correctness of subsequent actions.

2. Put Away: Put-away is the second warehouse process and


is the movement of goods from the receiving warehouse
to the most optimal warehouse storage location.
Failing to place goods in their most ideal location can
impair the productivity of warehouse operation. When
goods are put away properly, there are several benefits like
cargo is stored faster and more efficiently, travel time is
minimised, safety of goods is ensured and warehouse
space is utilisation is maximised.

3. Storage: Storage is the warehouse process in which goods


are placedinto their most appropriate storage space. When
done properly, the storage process fully maximises the
available space in your warehouse and increases labor
efficiency.
4. Picking: Picking is the warehouse process that collects
products in a warehouse to fulfils customer orders.
Correctness of this process is required to achieve higher
accuracy, as errors can have a direct impact on your
customersatisfaction.
5. Packing: Packing is the warehouse process that
consolidates picked items in a sales order and prepares them
for shipment to the customer.One of the primary tasks of
packing is to ensure that damages are minimised from the
time items leave the warehouse..

6. Shipment: Shipping is the final warehouse process and


the start of the journey of goods from the warehouse to the
customer. Shipping is considered successful only if the
right order is sorted and loaded, is dispatched to the right
customer, travels through the right transit mode, and is
delivered safely and on time
WAREHOUSE PLANNING AND IMPLEMENTATION

Planning a warehouse involves following steps:


a. Define your inventory: Define the products, types of
products and quantity of product you will store in the house.
b. Determine your storage needs: Types of storage
required, racks,shelves etc.
c. Assess your space: Measure the physical space
available and allocate it for storage, receiving and
shippingareas.
d. Evaluate your equipment needs: Equipments required
to handle products more efficiently like conveyer etc.
e. Plan your layout: Plan layout that maximise storage space.
f. Establish process and procedures: Decide on the
processes and procedures for receiving, storing, and shipping
products, and train your staff on these procedures

Implementing a warehouse require following steps:

b. Requirement analysis and capacity planning: Defining


enterprise needs, defining architectures, carrying out capacity
planning and selecting hardware and software tools.
c. Hardware Integration: Once the hardware and software has
been selected, they require to be put by integrating the servers,
the storage methods, and the user software tools.
d. Modelling: Modelling is a significant stage that involves
designing the warehouse schema and views
e. Sources: The information for the data warehouse is likely to
come from several data sources. SO, defining all the sources
falls here.
f. ETL: The data from the source system will require to go
through an ETL phase. The process of designing and
implementing the ETL phase may contain defining a
suitable ETL tool vendors and purchasing and implementing
the tools.
g. Populating the data warehouses: Once the ETL tools have
been agreed upon, testing the tools will be needed, perhaps
using a staging area. Once everything is working adequately,
the ETL tools may be used in populating the warehouses
(adding products in a warehouse) given the schema and view
definition

PARALLEL PROCESSORS AND CLUSTER SYSTEM

Parallel Processors:
a. Two or more processors work together to achieve a single task.
b. One task is divided into multiple tasks, each task is handled by
different processor in this way multiple processors work on different
parts of a single task to complete it.
c. Each processor will operate normally and will perform operations in
parallel as instructed.
d. At the end results from al the processors are combined to achieve a
endresult.

a. Two or more computers work together to provide high speed. Each


computer is known a s a node.
b. All computers work together, make us feel that single entity is
working.The connected computers execute operations all together
thus creating the idea of a single system
c. Each computer is connected to another node using LAN, all perform
their own separate tasks.
d. All the tasks get executed at a fast pace with multiple nodes.
DISTRIBUTED DATABASE SYSTEMS

a. A distributed database is essentially a database that is distributed


across numerous sites, i.e., on various computers or over a network
of computers, and is not restricted to a single system.

b. A distributed database system is spread across several locations with


nocommon physical components.

c. This can be necessary when different people from all over the world
need to access a certain database
d. It is of 2 types:
e. i. Homogenous: A homogeneous database stores data uniformly
across all locations. All sites utilise the same operating system,
database management system, and data structures. They are therefore
simple to handle.
Heterogenous: With a heterogeneous distributed database, many
locations may employ various software and schema, which may
cause issues with queries and transactions. Moreover, one site could
not be even aware of the existence of the other sites. Various
operating systems and database applications may be used by various
machines. Translations are therefore necessary for communication
across various sites

Warehousing Software

1.Odoo –

The best retail inventory management software

2. NetSuite –

The best warehouse management system for e-commerce

3. Infoplus –

The best warehouse management software for small business


4.Highjump –

The best warehouse management software with Agile solution

5. Blue Link ERP –

The best warehouse management software for medium-size


businesses
Problem Solving In AI
Solve the problem by performing logical algorithms, utilizing polynomial and
differential equations, and executing them using modeling paradigms

1. Chess.
2. Tower of Hanoi Problem
3. N-Queen Problem
4. Travelling Salesman Problem.
5. Water-Jug Problem.

Process of solving a problem consists of five steps


1. Defining The Problem
2. Analyzing The Problem
3. Identification Of Solutions.
4. Choosing a Solution
5. Implementation
Importance of Artificial Intelligence
1. Making our lives easier
2. Speed up your tasks and processes of work
3. Accuracy
4. Fully-utilized Data

Problem Solving In AI Examples


1. Tower of Hanoi Problem
Tower of Hanoi also called The problem of Benares Temple or Tower of
Brahma or Lucas' Tower

The objective of the puzzle is to move the entire stack to the last rod, obeying
the following rules:

1. Only one disk may be moved at a time.


2. Each move consists of taking the upper disk from one of the stacks
and placing it on top of another stack or on an empty rod.
3. No disk may be placed on top of a disk that is smaller than it.
2.N-Queen Problem
n x n chessboard

4 x 4 chessboard

8 x 8 chessboard

16 x 16 chessboard

1. 4 Queen problem.
N Queen is the problem of placing N chess queens on an N×N chessboard so
that no two queens attack each other.

1. No two queens same row

2. No two queens same column

3. No two queens same diagonal.


2.8 x 8 chessboard

3.Travelling Salesman Problem.

Travelling salesman problem also called the traveling salesperson problem


or TSP asks the following question: "Given a list of cities and the distances
between each pair of cities, what is the shortest possible route that visits each
city exactly once and returns to the origin city.
4. Water-Jug Problem.

Condition of Water Jug Problem


2 Smaller Size Jug Weight (Sum) is Greater then Equal to Large Size water Jug

Example 1
You have an 8 litre jug full of water and two smaller jugs, one that contains 5
litres and the other 3 litres. None of the jugs have markings on them, nor do you
have any additional measuring device.

Among the 3 jugs, divide the 8 liters into 2 equal parts i.e. 4 liters in jug A and 4
liters in jug B. How?

Solution

Condition of Water Jug Problem


2 Smaller Size Jug Weight (Sum) is Greater then Equal to Large Size water Jug.

5(litres)+ 3(litres) >= 8(litres)


Example 2

Two jugs one having the capacity to hold 3 gallons of water and the other has the
capacity to hold 4 gallons of water. There is no other measuring equipment
available and the jugs also do not have any kind of marking on them.

How can you get exactly 2 gallons of water in the 4-gallon

jug? Solution

Step 1
(0,0)

Step 2
((1,3)

Step 3
(1,0)

Step 4

(0,1)

Step 5
(4,1)

Step 6
(2,3)
Example 3

Solution
Condition of Water Jug Problem
2 Smaller Size Jug Weight (Sum) is Greater then Equal to Large Size water Jug.

8(litres)+ 5(litres) >= 12(litres)

SOLVED 7 steps

Step 1
12,0,0
Step 2

4,8,0

Step 3

4,3,5

Step 4

9,3,0

Step 5
9,0,3
Step 6

1,8,3

Step 7

1,6,5

Step 8

6,6,0

Decision Tree
Decision nodes are used to make any decision and have multiple branches,
whereas Leaf nodes are the output of those decisions. A decision tree can contain
categorical data (YES/NO) as well as numeric data.
It is a graphical representation for getting all the possible solutions to a
problem/decision based on given conditions.
Decision Tree is a Supervised learning technique.
A decision tree is a flowchart-like structure used to make decisions or predictions.
It consists of nodes representing decisions or tests on attributes, branches
representing the outcome of these decisions, and leaf nodes representing final
outcomes or predictions.
x
Machine Learning
Machine learning (ML) is a subdomain of artificial intelligence (AI) that focuses on
developing systems that learn—or improve performance—based on the data they ingest

Machine Learning is the field of study that gives computers the capability to
learn without being explicitly programmed
Issues in Machine Learning

1. Process Complexity of Machine Learning

2. Monitoring and maintenance

3. Inadequate Training Data


4. Poor quality of data
5. Customer Segmentation
History of Machine Learning
Before some years (about 40-50 years), machine learning was science fiction, but today
it is the part of our daily life. Machine learning is making our day to day life easy
from self-driving cars to Amazon virtual assistant "Alexa". However, the idea behind
machine learning is so old and has a long history. Below some milestones are given
which have occurred in the history of machine learning:

Data Science Vs Machine Learning

Data Science Machine Learning


Branch that deals with data. Machines utilize data science
techniques to learn about the data.

Many operations It is three types


data gathering, 1.Supervised learning,
data cleaning, 2.Unsupervised learning,
data manipulation, etc. 3.Reinforcement learning

Need the entire analytics universe. Combination of Machine and Data


Science.

It is a broad term for multiple It fits within data science.


disciplines.

Data scientists spent lots of time in ML engineers spend a lot of time for
handling the data, cleansing the data, managing the complexities that occur
and understanding its patterns. during the implementation of
algorithms and mathematical
concepts behind that.

Example: Netflix uses Data Science Example: Facebook uses Machine


technology. Learning technology.
Framework for building ML Systems-KDD process mode
KDD Process
KDD (Knowledge Discovery in Databases) is a process that involves the extraction of
useful, previously unknown, and potentially valuable information from large datasets.
Focus is on the discovery of useful knowledge, rather than simply finding patterns in
data
Techniques
1. Data cleaning
2. Data integration
3. Data selection
4. Data transformation
5. Data mining,
6. Pattern evaluation
7. knowledge representation and visualization.

Advantages of KDD
1. Improves decision-making.
2. Increased efficiency
3. Better customer service
4. Fraud detection

Disadvantages of KDD
1. Privacy concerns
2. Complexity
3. Data Quality
4. High cost.

Supervised learning

Supervised learning is a learning mechanism that infers the underlying relationship


between the observed data (also called input data) and a target variable

Classification: A classification problem is when the output variable is a category, such


as <Red= or <blue=, <disease= or <no disease=.

Regression: A regression problem is when the output variable is a real value, such as
<dollars= or <weight=.
Types:-
• Regression
• Logistic Regression
• Classification
• Naive Bayes Classifiers
• K−NN (k nearest neighbors)
• Decision Trees
• Support Vector Machine

Unsupervised Machine Learning


Unsupervised learning is a type of machine learning in which models are
trained using unlabeled dataset and are allowed to act on that data without any
supervision.”

We have the input data but no corresponding output data. in which the desired
output is unknown.

Working of Unsupervised Learning


Working of unsupervised learning can be understood by the below diagram:

Here, we have taken an unlabeled input data, which means it is not


categorized and corresponding outputs are also not given. Now, this unlabeled
input data is fed to the machine learning model in order to train it. Firstly, it
will interpret the raw data to find the hidden patterns from the data and then
will apply suitable algorithms such as k-means clustering, Decision tree, etc.
Once it applies the suitable algorithm, the algorithm divides the data objects
into groups according tothe similarities and difference between the objects.

Unsupervised Learning algorithms:

Below is the list of some popular unsupervised learning algorithms:


1. K-means clustering

2. KNN (k-nearest neighbors)

3. Hierarchal clustering

4. Anomaly detection

5. Neural Networks

6. Principle Component Analysis (PCA)

7. Independent Component Analysis

8. Apriori algorithm

9. Singular value decomposition

Disadvantages of Unsupervised Learning


1. More difficult than supervised learning
2. The result less accurate

Types of Unsupervised Learning Algorithm


The unsupervised learning algorithm can be further categorized into two
types of problems.
Clustering: Clustering is a method of grouping the objects into clusters
such that objects with most similarities remains into a group and has less
or no similarities with the objects of another group.
Association: An association rule is an unsupervised learning method
which is used for finding the relationships between variables in the large
database. It determines the set of items that occurs together in the dataset.
Association rule makes marketing strategy more effective. Such as people
who buy X item (suppose a bread) are also tend to purchase Y
(Butter/Jam) item. A typical example of Association rule is Market Basket
Analysis.

Differences between Classification and Clustering

Classification Clustering

Classification is used for clustering is used for


supervised learning unsupervised learning.

Classification is more Less Complex only


complex grouping is done

Input instances based on Grouping the instances


their corresponding class based on their similarity
labels

Two Step Process Single Step Process

( Train+ Predict)

No of Categories Known No of Group Unknown

Examples are Examples are

1. Logistic regression 1. k−means clustering


2. Naive Bayes classifier
2. Fuzzy c−means
3 Support vector machines
3. Gaussian (EM) clustering
Support Vector Machines (SVM)
1. Support Vector Machine or SVM is one of the most popular Supervised
Learning algorithms,
2. which is used for Classification as well as Regression problems
3. Primarily, it is used forClassification problems in Machine Learning.

.The goal of the SVM algorithm is to create the best line ordecision boundary that can
segregate n-dimensional space into classes so that we can easily put thenew data point
in the correct category in the future. This best decision boundary is called a hyperplane.
SVM chooses the extreme points/vectors that help in creating the hyperplane. These
extreme cases arecalled as support vectors, and hence algorithm is termed as Support
Vector Machine.
SVM algorithm can be used for Face detection, image classification, text
categorization.

Support Vectors – Data points that are closest to the hyperplane is called
support vectors.
Separating line will be defined with the help of these data points.
Hyperplane − As we can see in the above diagram, it is a decision plane or
space which is divided
between a set of objects having different classes.
Margin − It may be defined as the gap between two lines on the closet data
points of different
classes. It can be calculated as the perpendicular distance from the line to the
support vectors.
Large margin is considered as a good margin and small margin is considered
as a bad margin.

Example: SVM can be understood with the example that we have used in the
KNN classifier. Suppose we see a strange cat that also has some features of
dogs, so if we want a model that can accurately identify whether it is a cat or
dog, so such a model can be created by using the SVM algorithm. We will
first train our model with lots of images of cats and dogs so that it can learn
about different features of cats and dogs, and then we test it with this strange
creature. So as support vector creates a decision boundary between these two
data (cat and dog) and choose extreme cases (support vectors), it will see the
extreme case of cat and dog. On the basis of the support vectors, it will classify
it as a cat. Consider the below diagram
Bayesian Network

1. Solve a problem which has uncertainty

2. It is also called a Bayes network, belief network, decision network,


or Bayesian model.

3. Conditional dependencies using a directed acyclic graph.

4. Supervised learning algorithms

ANN (Artificial Neural Network)


An Artificial neural network is usually a computational network based on
biological neural networks that construct the structure of the human brain.

Unsupervised learning algorithms


Artificial Neural Network primarily consists of three layers
Genetic Algorithm

1. Genetic algorithm is a general heuristic search method designed for finding the
optimal solution to a problem.
2. Supervised Learning algorithms
3. Operators such as selection, crossover, and mutation.

GA applications
Some examples of GA applications
1. Include optimizing decision trees for better performance,
2.
Solving sudoku puzzles,
3. Hyperparameter optimization,
4. Causal inference.

The algorithm starts with a set of trial structures, or parents, and uses their fitness to
create a new generation, or offspring
GAs are well-suited for problems with large search spaces, or when the fitness function
is noisy. They can be competitive with other methods, and can be implemented in
parallel.
Reinforcement Learning

1. Reinforcement Learning is a feedback−based Machine learning technique in


which an agent learns to behave in an environment by performing the actions and
seeing the results of actions. For each good action, the agent gets positive
feedback, and for each bad action, the agent gets negative feedback or penalty

2. Solve More Complex Problem of Supervised and Unsupervised Learning


3. No labeled data, so the agent is bound to learn by its experience only.
4. Game-playing, robotics

You might also like