Unit 2 DW&DM Notes Mr. Rohit Pratap Singh
Unit 2 DW&DM Notes Mr. Rohit Pratap Singh
a. Star
b. Snow Flake
c. Fact Constellations
FACT CONSTELLATIONS
Hardware Considerations
1. Processing Power:
• Multi-core processors: Data warehouses benefit from parallel processing capabilities
offered by multi-core CPUs.
• High clock speeds: Faster processors can handle complex queries and data
transformations more efficiently.
2. Memory (RAM):
• Sufficient RAM allows for faster data access and query processing.
• In-memory processing: Consider systems with large amounts of RAM or in-memory
databases for improved performance.
3. Storage:
• High-performance storage: Use solid-state drives (SSDs) or high-speed storage arrays
for fast data access.
• Scalable storage: Ensure the storage system can scale with growing data volumes.
4. Network:
• High-speed network connections: Fast network infrastructure minimizes data transfer
latency between components of the data warehouse architecture.
• Redundancy: Implement redundant network connections to ensure high availability
and fault tolerance.
5. Scalability:
• Scalable architecture: Choose hardware that supports horizontal scaling to
accommodate growing data and user loads.
• Distributed processing: Consider distributed computing frameworks like Apache
Hadoop or Spark for scalable processing.
6. Data Redundancy and Fault Tolerance:
• RAID configurations: Use RAID (Redundant Array of Independent Disks) for data
redundancy and fault tolerance.
• Backup systems: Implement regular backups and disaster recovery solutions to protect
against data loss.
7. Hardware Acceleration:
• GPU acceleration: Graphics processing units (GPUs) can accelerate certain data
processing tasks, such as machine learning algorithms and complex analytics.
Operating System Considerations
1. Compatibility:
• Ensure compatibility with the chosen database management system (DBMS) and other
software components of the data warehouse stack.
2. Performance:
• Choose operating systems known for stability, performance, and reliability.
• Linux distributions like CentOS, Red Hat Enterprise Linux (RHEL), or Ubuntu Server are
popular choices for data warehousing due to their stability and performance.
3. Security:
• Select an operating system with robust security features and regular updates to protect
against vulnerabilities.
• Implement access controls, firewalls, and encryption to secure data and infrastructure.
4. Manageability:
• Choose an operating system with robust management tools and support for
automation.
• Consider systems with centralized management capabilities for easier administration of
multiple servers.
5. Compatibility with Tools and Software:
• Ensure compatibility with data warehousing software, ETL tools, monitoring tools, and
other components of the data warehouse ecosystem.
6. Scalability and Resource Management:
• Operating systems should support resource management features like process
scheduling, memory management, and disk I/O optimization to ensure efficient
resource utilization.
7. Virtualization and Containerization:
• Consider virtualization or containerization technologies like VMware, Docker, or
Kubernetes for flexible deployment and resource allocation.
Warehousing Strategy
WAREHOUSE MANAGEMENT AND SUPPORTPROCESS
Parallel Processors:
a. Two or more processors work together to achieve a single task.
b. One task is divided into multiple tasks, each task is handled by
different processor in this way multiple processors work on different
parts of a single task to complete it.
c. Each processor will operate normally and will perform operations in
parallel as instructed.
d. At the end results from al the processors are combined to achieve a
endresult.
c. This can be necessary when different people from all over the world
need to access a certain database
d. It is of 2 types:
e. i. Homogenous: A homogeneous database stores data uniformly
across all locations. All sites utilise the same operating system,
database management system, and data structures. They are therefore
simple to handle.
Heterogenous: With a heterogeneous distributed database, many
locations may employ various software and schema, which may
cause issues with queries and transactions. Moreover, one site could
not be even aware of the existence of the other sites. Various
operating systems and database applications may be used by various
machines. Translations are therefore necessary for communication
across various sites
Warehousing Software
1.Odoo –
2. NetSuite –
3. Infoplus –
1. Chess.
2. Tower of Hanoi Problem
3. N-Queen Problem
4. Travelling Salesman Problem.
5. Water-Jug Problem.
The objective of the puzzle is to move the entire stack to the last rod, obeying
the following rules:
4 x 4 chessboard
8 x 8 chessboard
16 x 16 chessboard
1. 4 Queen problem.
N Queen is the problem of placing N chess queens on an N×N chessboard so
that no two queens attack each other.
Example 1
You have an 8 litre jug full of water and two smaller jugs, one that contains 5
litres and the other 3 litres. None of the jugs have markings on them, nor do you
have any additional measuring device.
Among the 3 jugs, divide the 8 liters into 2 equal parts i.e. 4 liters in jug A and 4
liters in jug B. How?
Solution
Two jugs one having the capacity to hold 3 gallons of water and the other has the
capacity to hold 4 gallons of water. There is no other measuring equipment
available and the jugs also do not have any kind of marking on them.
jug? Solution
Step 1
(0,0)
Step 2
((1,3)
Step 3
(1,0)
Step 4
(0,1)
Step 5
(4,1)
Step 6
(2,3)
Example 3
Solution
Condition of Water Jug Problem
2 Smaller Size Jug Weight (Sum) is Greater then Equal to Large Size water Jug.
SOLVED 7 steps
Step 1
12,0,0
Step 2
4,8,0
Step 3
4,3,5
Step 4
9,3,0
Step 5
9,0,3
Step 6
1,8,3
Step 7
1,6,5
Step 8
6,6,0
Decision Tree
Decision nodes are used to make any decision and have multiple branches,
whereas Leaf nodes are the output of those decisions. A decision tree can contain
categorical data (YES/NO) as well as numeric data.
It is a graphical representation for getting all the possible solutions to a
problem/decision based on given conditions.
Decision Tree is a Supervised learning technique.
A decision tree is a flowchart-like structure used to make decisions or predictions.
It consists of nodes representing decisions or tests on attributes, branches
representing the outcome of these decisions, and leaf nodes representing final
outcomes or predictions.
x
Machine Learning
Machine learning (ML) is a subdomain of artificial intelligence (AI) that focuses on
developing systems that learn—or improve performance—based on the data they ingest
Machine Learning is the field of study that gives computers the capability to
learn without being explicitly programmed
Issues in Machine Learning
Data scientists spent lots of time in ML engineers spend a lot of time for
handling the data, cleansing the data, managing the complexities that occur
and understanding its patterns. during the implementation of
algorithms and mathematical
concepts behind that.
Advantages of KDD
1. Improves decision-making.
2. Increased efficiency
3. Better customer service
4. Fraud detection
Disadvantages of KDD
1. Privacy concerns
2. Complexity
3. Data Quality
4. High cost.
Supervised learning
Regression: A regression problem is when the output variable is a real value, such as
<dollars= or <weight=.
Types:-
• Regression
• Logistic Regression
• Classification
• Naive Bayes Classifiers
• K−NN (k nearest neighbors)
• Decision Trees
• Support Vector Machine
We have the input data but no corresponding output data. in which the desired
output is unknown.
3. Hierarchal clustering
4. Anomaly detection
5. Neural Networks
8. Apriori algorithm
Classification Clustering
( Train+ Predict)
.The goal of the SVM algorithm is to create the best line ordecision boundary that can
segregate n-dimensional space into classes so that we can easily put thenew data point
in the correct category in the future. This best decision boundary is called a hyperplane.
SVM chooses the extreme points/vectors that help in creating the hyperplane. These
extreme cases arecalled as support vectors, and hence algorithm is termed as Support
Vector Machine.
SVM algorithm can be used for Face detection, image classification, text
categorization.
Support Vectors – Data points that are closest to the hyperplane is called
support vectors.
Separating line will be defined with the help of these data points.
Hyperplane − As we can see in the above diagram, it is a decision plane or
space which is divided
between a set of objects having different classes.
Margin − It may be defined as the gap between two lines on the closet data
points of different
classes. It can be calculated as the perpendicular distance from the line to the
support vectors.
Large margin is considered as a good margin and small margin is considered
as a bad margin.
Example: SVM can be understood with the example that we have used in the
KNN classifier. Suppose we see a strange cat that also has some features of
dogs, so if we want a model that can accurately identify whether it is a cat or
dog, so such a model can be created by using the SVM algorithm. We will
first train our model with lots of images of cats and dogs so that it can learn
about different features of cats and dogs, and then we test it with this strange
creature. So as support vector creates a decision boundary between these two
data (cat and dog) and choose extreme cases (support vectors), it will see the
extreme case of cat and dog. On the basis of the support vectors, it will classify
it as a cat. Consider the below diagram
Bayesian Network
1. Genetic algorithm is a general heuristic search method designed for finding the
optimal solution to a problem.
2. Supervised Learning algorithms
3. Operators such as selection, crossover, and mutation.
GA applications
Some examples of GA applications
1. Include optimizing decision trees for better performance,
2.
Solving sudoku puzzles,
3. Hyperparameter optimization,
4. Causal inference.
The algorithm starts with a set of trial structures, or parents, and uses their fitness to
create a new generation, or offspring
GAs are well-suited for problems with large search spaces, or when the fitness function
is noisy. They can be competitive with other methods, and can be implemented in
parallel.
Reinforcement Learning