0% found this document useful (0 votes)

18 views48 pages

03 Data Science Process_Spring-24-25

Chapter 2 outlines the data science process, emphasizing the distinctions between data, information, knowledge, and wisdom. It introduces the CRISP-DM framework, detailing its six phases: business understanding, data understanding, data preparation, modeling, evaluation, and deployment. The chapter also discusses various data types, learning methods, data preparation techniques, and the importance of data quality in the context of effective data analysis.

Uploaded by

Rajesh Paul

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

18 views48 pages

03 Data Science Process_Spring-24-25

Uploaded by

Rajesh Paul

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 48

Chapter 2

Data Science Process

• Data
• a "given" or a fact that represents something in real world
• raw materials, can be processed, structured or unstructured
• Data are elements of analysis

• Information • Knowledge is not information

• Data that have meaning in context and information is not data.
• Data related • Knowledge is derived from
• Data after manipulation information in the same way
information is derived from
• Knowledge data.
• In data science, knowledge refers to meaningful insights or patterns derived from data through
analysis, modeling, and interpretation.
• Acquired through experience or learning
• It is a concept mainly for humans unlike data and information.

2
Example

• Data – Raw, unprocessed facts (e.g., "500 movie reviews").

• Information – Processed or structured data (e.g., "300 positive and 200 negative movie reviews").
• Knowledge – Actionable insights extracted from the information (e.g., "Reviews mentioning
'amazing' and 'great' tend to be positive").
• Wisdom – Applying knowledge effectively for decision-making (e.g., "A recommendation system
suggests movies based on positive reviews").

3
CRISP Data Mining Framework
CRISP Data Mining Framework

The CRoss Industry Standard Process for Data Mining (CRISP-DM) is a process model that serves as
the base for a data science process. It has six sequential phases:

Business understanding – What does the business need?

Data understanding – What data do we have / need? Is it clean?
Data preparation – How do we organize the data for modeling?
Modeling – What modeling techniques should we apply?
Evaluation – Which model best meets the business objectives?
Deployment – How do stakeholders access the results?
Data Science Process
Prior Knowledge
Gaining information on:

Objective of the problem

Subject area of the problem
Data
Objective of the problem

The data science process starts with a need for analysis, a question, or a business objective. This is
possibly the most important step in the data science Process. Without a well-defined statement of the
problem, it is impossible to come up with the right dataset and pick the right data science algorithm.

Example : The business objective of this hypothetical case is: If the interest rate of past borrowers with
a range of credit scores is known, can the interest rate for a new borrower be predicted?
Subject area of the problem

The process of data science uncovers hidden patterns in the dataset by exposing relationships between
attributes. But the problem is that it uncovers a lot of patterns. The false or spurious signals are a major
problem in the data science process. It is up to the practitioner to sift through the exposed patterns and
accept the ones that are valid and relevant to the answer to the objective question. Hence, it is essential
to know the subject matter, the context, and the business process generating the data.

Example: The lending business is one of the oldest, most prevalent, and complex of al the businesses. If
the objective is to predict the lending interest rate, then it is important to know how the lending business
works,
Data

Understanding how the data is collected, stored, transformed, reported, and used is essential to the data
science process. There are quite a range of factors to consider: quality of the data, quantity of data. The
objective of this step is to come up with a dataset to answer the business question through the data
science process. For the following example, a sample dataset of ten data points with three attributes has
been put together: identifier, credit score, and interest rate.
Data Types
• Two types of data: Labelled Data & Unlabelled Data

• Labelled data

• Specially designated attribute and the aim is to use the data given to predict the value of that attribute
for instances that have not yet been seen. Data of this kind is called labelled.

Outlook=sun Temp= Humidity= Windy=fal Class

ny 79 88 se =?
Data Types
• Unlabelled data

• Data that does not have any specially designated attribute is called unlabelled.
• Here the aim is simply to extract the most information we can from the data available .
Learning Methods

• Supervised Learning
• Data mining using labelled data is known as supervised learning.

• Classification

• If the designated attribute is categorical, the task is called classification.

• Classification is one form of prediction, where the value to be predicted is a label.

• a hospital may want to classify medical patients into those who are at high, medium or low risk

• we may wish to classify a student project as distinction, merit, pass or fail

• Nearest Neighbour Matching, Classification Rules, Classification Tree, …

Learning Methods

• Numerical Prediction (Regression)

• If the designated attribute is numerical, the task is called numerical prediction (regression).

• Numerical prediction (often called regression) is another. In this case we wish to predict a
numerical value, such as a company’s profits or a share price.
Example: Suppose
• A very popular there
way of doing isuse
this is to a a marketing
Neural Network
company A, who does various
advertisement every year and get sales
on that. The list (right-side) shows the
advertisement made by the company in
the last 5 years and the corresponding
sales. Now, the company wants to do the
advertisement of $200 in the year 2019
and wants to know the prediction about
the sales for this year.
Learning Methods
• Regression (Numerical Prediction)

15
Learning Methods
• Unsupervised Learning
• Data mining using unlabelled data is known as unsupervised learning.

• Association Rules

• Sometimes we wish to use a training set to find any relationship that exists amongst the values of
variables, generally in the form of rules known as association rules.

• APRIORI

• Market Basket Analysis

Example, if a customer buys bread, he most likely can also
buy butter, eggs, or milk, so these products are stored
within a shelf or mostly nearby.
Possible associations include:
1.if customers purchase milk they also purchase
bread {milk} → {bread}
2.if customers purchase milk and eggs they also purchase
butter and bread {milk, eggs} → { butter, bread}
Learning Methods
• Clustering
• Clustering algorithms examine data to find groups of items that are
similar.
• K-Means Clustering, Agglomerative Hierarchical Clustering

17
A dataset (example set) is a collection of data with a defined structure. Table 2.1 shows a dataset. It
has a well-defined structure with 10 rows and 3 columns along with the column headers. This structure
is also sometimes referred to as a “data frame”.

A data point (record, object or example) is a single instance in the dataset. Each row in Table 2.1 is a
data point. Each instance contains the same structure as the dataset.

 An attribute (feature, input, dimension, variable, or predictor) is a single property of the dataset. Each
column in Table 2.1 is an attribute. Attributes can be numeric, categorical, date-time, text, or Boolean
data types. In this example, both the credit score and the interest rate are numeric attributes.
A label (class label, output, prediction, target, or response) is the special attribute to be predicted based
on all the input attributes. In Table 2.1, the interest rate is the output variable .
Data Preparation

 Data Exploration
 Data quality
 Missing values
 Noisy values
 Invalid values
 Data types and Conversion
 Transformation
 Outliers
 Feature selection
 Sampling
Data Exploration

Data preparation starts with an in-depth exploration of the data and understanding
the dataset better. Data exploration, also known as exploratory data analysis(EDA),
provides simple tools to understand the data fully. Exploratory Data Analysis (EDA)
is the process of analyzing and visualizing data to understand its structure, identify
patterns, detect outliers, and gather insights before applying machine learning
models .Data exploration approaches involve computing descriptive statistics and
visualization of data.
Data Exploration

22
Data Quality

Data quality is the measure of how well suited a data set is to serve its specific
purpose. Measures of data quality are based Data Correctness, Data Freshness and
Data Completeness.

Data Correctness: How accurately the data value describes real-world facts. In other words,
data correctness ensures that each data point is both accurate and reliable, meaning it reflects the truth as
closely as possible. Data is accurate when it represents real-world values without error. For example, if a
customer’s age is listed as 29 in a database, data accuracy would mean that this number actually reflects
the customer's real age.. There are many potential root causes of collection issues such as collection
noise, faulty data transformations, outdated data, or incorrect schema description.
Why Data Correctness is Important ?

Data correctness is critical because incorrect data can lead to poor decisions,
unreliable results, and even damage to an organization’s credibility.

For example: In healthcare, incorrect patient data could lead to inappropriate

treatment.
Data Freshness : This refers to how relevant the data is to describe the current state of an
entity, and takes into consideration the timeliness of the data and how frequently it is updated. Data is
correct when it is also up-to-date. Real-world values change over time, so data representing those values
must be updated accordingly.

Example: A customer’s address may change, and a database would need to reflect that change promptly
to stay accurate.
Data Completeness: Data completeness refers to whether all necessary values are
present in the dataset. For instance, if a dataset tracks employees, each employee entry should have
essential information like name, ID, position, etc. Missing data can lead to an incomplete view of reality,
which affects the correctness of the entire dataset.
Missing Values

In many real-world datasets data values are not recorded for all attributes. This can happen simply
because there are some attributes that are not applicable for some instances, a malfunction of the
equipment used to record the data, a data collection form to which additional fields were added after
some data had been collected, information that could not be obtained, e.g. about a hospital patient

k-nearest neighbor (k-NN) algorithm for classification tasks are often robust with missing values. Neural
network models for classification tasks do not perform well with missing attributes, and thus, the data
preparation step is essential for developing neural network models
Methods to Handle Missing Values

• Discard Instances

• Replace by Most Frequent/Average Value

Discard Instances

• This is the simplest strategy: delete all instances where there is at least one missing value and use
the remainder.

• It has the advantage of avoiding introducing any data errors. Its disadvantage is that discarding
data may damage the reliability of the results derived from the data
Replace by Most Frequent/Average Value

• A less cautious strategy is to estimate each of the missing values using the values that are present
in the dataset.

• A straightforward but effective way of doing this for a categorical attribute is to use its most
frequently occurring (non-missing) value

• In the case of continuous attributes, it is likely that no specific numerical value will occur more
than a small number of times. In this case the estimate used is generally the average value.
Noisy Values

• A noisy value to mean one that is valid for the dataset, but is incorrectly recorded

• The number 69.72 may accidentally be entered as 6.972, or a categorical attribute value such as
brown may accidentally be recorded as another of the possible values, such as blue.
Invalid Values

•69.7X for 6.972 or bbrown for brown

•An invalid value can easily be detected and either corrected or rejected
Data types and Conversion

The attributes in a dataset can be of different types, such as continuous numeric (interest rate), integer
numeric (credit score), or categorical. For example, the credit score can be expressed as categorical
values (poor, good, excellent) or numeric score. Different data science algorithms impose different
restrictions on the attribute data types.
Transformation

In some data science algorithms like k-NN, the input attributes are expected to be numeric and
normalized, because the algorithm compares the values of different attributes and calculates distance
between the data points. Normalization prevents one attribute dominating the distance results because of
large values. To overcome this problem ,we generally normalize the values of continuous attributes.

The idea is to make the values of each attribute run from 0 to 1. In general, if the lowest value of
attribute A is min and the highest value is max, we convert each value of A, say a, to (a − min)/(max −
min).
Transformation

First 6 rows of Normalized data (Min-Max

First 13 rows of the salary data
Normalization)
35
Outliers

Outliers are those data points that are significantly different from the rest of the dataset. They are often
abnormal observations that skew the data distribution, and arise due to inconsistent data entry, or
erroneous observations. Detecting outliers may be the primary purpose of some data science
applications, like fake email detection, fraud or intrusion detection.
Feature Selection

Reducing the number of attributes, without significant loss in the performance of the model, is called
feature selection.

many data science problems involve a dataset with hundreds to thousands of attributes. In text mining
applications, every distinct word in a document forms a distinct attribute in the dataset. Not all the
attributes are equally important or useful in predicting the target. The presence of some attributes might
be counter productive. Some of the attributes may be highly correlated with each other, like annual
income and taxes paid. A large number of attributes in the dataset significantly increases the complexity
of a model and may degrade the performance of the model due to the curse of dimensionality
Glaucoma Dataset (62
features)
Feature Selection

38
Glaucoma Dataset (62
features)
Feature Selection

Regularized Random Forest

– Variable Importance

39
Data Sampling

Sampling is a process of selecting a subset of records as a representation of the original dataset for use in
data analysis or modeling. The sample data serve as a representative of the original dataset with similar
properties, such as a similar mean. Sampling reduces the amount of data that need to be processed and
speeds up the build process of the modeling. In most cases, to gain insights, extract the information, and
to build representative predictive models it is sufficient to work with samples. Theoretically, the error
introduced by sampling impacts the relevancy of the model, but their benefits far outweigh the risks.
MODELING

A model is the abstract representation of the data and the relationships in a given dataset.
MODELING

Splitting Training and Test data sets: The modeling step creates a representative model inferred from the
data. The dataset used to create the model, with known attributes and target, is called the training
dataset.

The validity of the created model will also need to be checked with another known dataset called the
test dataset or validation dataset. To facilitate this process, the overall known dataset can be split into a
training dataset and a test dataset. A standard rule of thumb is two-thirds of the data are to be used as
training and one-third as a test dataset
Training Set and Test Set
MODELING

Splitting Training and Test data sets

MODELING
MODELING

Evaluation of test dataset

Application

Product readiness
Technical integration
Model response time
Remodeling
Assimilation
Knowledge

Posterior knowledge

03 Data Science Process_Fall 23-24
No ratings yet
03 Data Science Process_Fall 23-24
38 pages
Data Science S3mca
No ratings yet
Data Science S3mca
55 pages
Data2 Science Process Am
No ratings yet
Data2 Science Process Am
33 pages
DTS Modul Data Science Methodology
100% (1)
DTS Modul Data Science Methodology
56 pages
Data Mining
No ratings yet
Data Mining
40 pages
Summary Business Analytics
No ratings yet
Summary Business Analytics
24 pages
Unit-2
No ratings yet
Unit-2
21 pages
Data - part 1
No ratings yet
Data - part 1
58 pages
Data Science
100% (2)
Data Science
33 pages
Introduction To Analytics
100% (1)
Introduction To Analytics
45 pages
Unit 3
No ratings yet
Unit 3
18 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
33 pages
Unit 2 PPT (BA)
No ratings yet
Unit 2 PPT (BA)
33 pages
Unit I
No ratings yet
Unit I
57 pages
Data Preparation and Exploration: DSCI 5240 Data Mining and Machine Learning For Business Russell R. Torres
No ratings yet
Data Preparation and Exploration: DSCI 5240 Data Mining and Machine Learning For Business Russell R. Torres
28 pages
UNIT-2
No ratings yet
UNIT-2
19 pages
Unit 2 Data science process plus
No ratings yet
Unit 2 Data science process plus
24 pages
Data Understanding and Prepration
100% (1)
Data Understanding and Prepration
10 pages
3 DSEngineering
No ratings yet
3 DSEngineering
64 pages
Unit 2 Data Science Process (P)
No ratings yet
Unit 2 Data Science Process (P)
24 pages
Unit 1 - Exploratory Data Analysis Fundamentals
No ratings yet
Unit 1 - Exploratory Data Analysis Fundamentals
47 pages
Unit-1 Data Science
No ratings yet
Unit-1 Data Science
74 pages
Lecture 4
No ratings yet
Lecture 4
20 pages
Data Science 2
No ratings yet
Data Science 2
55 pages
Data Preprocessing
No ratings yet
Data Preprocessing
12 pages
DSV Sem Exam
No ratings yet
DSV Sem Exam
15 pages
TTDS Lectures
No ratings yet
TTDS Lectures
13 pages
2 Data Preprocessing
No ratings yet
2 Data Preprocessing
57 pages
COS10022 - Lecture 03 - Data Preparation PDF
No ratings yet
COS10022 - Lecture 03 - Data Preparation PDF
61 pages
DP
No ratings yet
DP
44 pages
Starting A Data Science Team: Dr. Jonathan D. Adler
No ratings yet
Starting A Data Science Team: Dr. Jonathan D. Adler
39 pages
22UCS303 DS-Unit II-N
No ratings yet
22UCS303 DS-Unit II-N
71 pages
Preprocessing
No ratings yet
Preprocessing
50 pages
DMlecture1
No ratings yet
DMlecture1
39 pages
Que Es Datamin
No ratings yet
Que Es Datamin
52 pages
Data Cleaning: Missing Values: - For Example in Attribute Income If
No ratings yet
Data Cleaning: Missing Values: - For Example in Attribute Income If
30 pages
Introduction to Data Science
No ratings yet
Introduction to Data Science
23 pages
HIT391-week 3-New
No ratings yet
HIT391-week 3-New
43 pages
Introduction Data Science Edited
No ratings yet
Introduction Data Science Edited
33 pages
FDS Unit-4
No ratings yet
FDS Unit-4
15 pages
Statistics for Data Science
No ratings yet
Statistics for Data Science
39 pages
Introduction To Data Analytics
No ratings yet
Introduction To Data Analytics
33 pages
Data Preparation
No ratings yet
Data Preparation
21 pages
What Is Data Science
No ratings yet
What Is Data Science
13 pages
Assignment Solution 074
No ratings yet
Assignment Solution 074
8 pages
Datapreparation
No ratings yet
Datapreparation
59 pages
Introduction of Data Science.docx
No ratings yet
Introduction of Data Science.docx
28 pages
UNIT _ Introduction_DataScience_new (1)
No ratings yet
UNIT _ Introduction_DataScience_new (1)
55 pages
An Introduction To Data Mining
No ratings yet
An Introduction To Data Mining
47 pages
Data Preprocessingedfgh
No ratings yet
Data Preprocessingedfgh
21 pages
intro
No ratings yet
intro
144 pages
Lecture 1 - Introduction To Data Science
No ratings yet
Lecture 1 - Introduction To Data Science
14 pages
Preprocessing Techniques
No ratings yet
Preprocessing Techniques
63 pages
M1.1 DS
No ratings yet
M1.1 DS
57 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
6-Significance of Exploratory Data Analysis, Making Sense of Data-06!02!2024
No ratings yet
6-Significance of Exploratory Data Analysis, Making Sense of Data-06!02!2024
85 pages
Data Collection: Six Sigma Thinking, #1
From Everand
Data Collection: Six Sigma Thinking, #1
Sumeet Savant
No ratings yet
Introduction to Robotics
From Everand
Introduction to Robotics
Swarnalata Verma
No ratings yet
(Excerpts From) Investigating Performance: Design and Outcomes With Xapi
From Everand
(Excerpts From) Investigating Performance: Design and Outcomes With Xapi
Janet Laane Effron
No ratings yet
Principles of Data Mining
From Everand
Principles of Data Mining
Subodh Keshari
No ratings yet
assignment
No ratings yet
assignment
6 pages
Assignment Edited
No ratings yet
Assignment Edited
2 pages
AI Project Cycle
100% (2)
AI Project Cycle
7 pages
Log
No ratings yet
Log
2 pages
.Machine Learning Algorithms Trends, Perspectives and Prospects
No ratings yet
.Machine Learning Algorithms Trends, Perspectives and Prospects
8 pages
Group 2 Room 2
No ratings yet
Group 2 Room 2
9 pages
Topic 2 Data Models
No ratings yet
Topic 2 Data Models
49 pages
Emerging Trends Notes 1 30 10 22 20221030111517351
No ratings yet
Emerging Trends Notes 1 30 10 22 20221030111517351
2 pages
Module 3 Part 1 (1)
No ratings yet
Module 3 Part 1 (1)
14 pages
NLP m2
No ratings yet
NLP m2
74 pages
Database Design and Development
No ratings yet
Database Design and Development
2 pages
Balaji
No ratings yet
Balaji
8 pages
Dissertation Contrat Administratif PDF
100% (2)
Dissertation Contrat Administratif PDF
8 pages
Flower Shop Billing Management System Project
No ratings yet
Flower Shop Billing Management System Project
24 pages
English Practical Report (1)
No ratings yet
English Practical Report (1)
22 pages
Vishwajeet Synopsis
No ratings yet
Vishwajeet Synopsis
13 pages
Assignment No. 1 DS
No ratings yet
Assignment No. 1 DS
4 pages
INTRODUCTION TO ORACLE 10G
No ratings yet
INTRODUCTION TO ORACLE 10G
4 pages
Varun resume fc
No ratings yet
Varun resume fc
2 pages
DBMS Day Wise Lession Plan 2021
No ratings yet
DBMS Day Wise Lession Plan 2021
2 pages
What Is The Semantic Web
No ratings yet
What Is The Semantic Web
5 pages
Gravity Falls Theraprism Cipher Translator - Online Decoder
No ratings yet
Gravity Falls Theraprism Cipher Translator - Online Decoder
6 pages
Syllabus of Data Privacy VIT
No ratings yet
Syllabus of Data Privacy VIT
2 pages
CoDa Preprint PPS
No ratings yet
CoDa Preprint PPS
45 pages
GEN AI.Question Bank EndSem
No ratings yet
GEN AI.Question Bank EndSem
9 pages
Mayan Edms
No ratings yet
Mayan Edms
143 pages
Im Final
No ratings yet
Im Final
38 pages
Compiler Construction Lectures
No ratings yet
Compiler Construction Lectures
20 pages
Problem Statement
No ratings yet
Problem Statement
2 pages
12
0% (1)
12
3 pages

Uploaded by

Uploaded by

Chapter 2

Data Science Process

• Information • Knowledge is not information

• Data – Raw, unprocessed facts (e.g., "500 movie reviews").

Business understanding – What does the business need?

Objective of the problem

Outlook=sun Temp= Humidity= Windy=fal Class

• If the designated attribute is categorical, the task is called classification.

• Classification is one form of prediction, where the value to be predicted is a label.

• we may wish to classify a student project as distinction, merit, pass or fail

• Nearest Neighbour Matching, Classification Rules, Classification Tree, …

• Numerical Prediction (Regression)

• Market Basket Analysis

For example: In healthcare, incorrect patient data could lead to inappropriate

• Replace by Most Frequent/Average Value

•69.7X for 6.972 or bbrown for brown

First 6 rows of Normalized data (Min-Max

Regularized Random Forest

Splitting Training and Test data sets

Evaluation of test dataset

You might also like