0% found this document useful (0 votes)
18 views48 pages

03 Data Science Process_Spring-24-25

Chapter 2 outlines the data science process, emphasizing the distinctions between data, information, knowledge, and wisdom. It introduces the CRISP-DM framework, detailing its six phases: business understanding, data understanding, data preparation, modeling, evaluation, and deployment. The chapter also discusses various data types, learning methods, data preparation techniques, and the importance of data quality in the context of effective data analysis.

Uploaded by

Rajesh Paul
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views48 pages

03 Data Science Process_Spring-24-25

Chapter 2 outlines the data science process, emphasizing the distinctions between data, information, knowledge, and wisdom. It introduces the CRISP-DM framework, detailing its six phases: business understanding, data understanding, data preparation, modeling, evaluation, and deployment. The chapter also discusses various data types, learning methods, data preparation techniques, and the importance of data quality in the context of effective data analysis.

Uploaded by

Rajesh Paul
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 48

Chapter 2

Data Science Process


• Data
• a "given" or a fact that represents something in real world
• raw materials, can be processed, structured or unstructured
• Data are elements of analysis

• Information • Knowledge is not information


• Data that have meaning in context and information is not data.
• Data related • Knowledge is derived from
• Data after manipulation information in the same way
information is derived from
• Knowledge data.
• In data science, knowledge refers to meaningful insights or patterns derived from data through
analysis, modeling, and interpretation.
• Acquired through experience or learning
• It is a concept mainly for humans unlike data and information.

2
Example

• Data – Raw, unprocessed facts (e.g., "500 movie reviews").


• Information – Processed or structured data (e.g., "300 positive and 200 negative movie reviews").
• Knowledge – Actionable insights extracted from the information (e.g., "Reviews mentioning
'amazing' and 'great' tend to be positive").
• Wisdom – Applying knowledge effectively for decision-making (e.g., "A recommendation system
suggests movies based on positive reviews").

3
CRISP Data Mining Framework
CRISP Data Mining Framework

The CRoss Industry Standard Process for Data Mining (CRISP-DM) is a process model that serves as
the base for a data science process. It has six sequential phases:

Business understanding – What does the business need?


Data understanding – What data do we have / need? Is it clean?
Data preparation – How do we organize the data for modeling?
Modeling – What modeling techniques should we apply?
Evaluation – Which model best meets the business objectives?
Deployment – How do stakeholders access the results?
Data Science Process
Prior Knowledge
Gaining information on:

Objective of the problem


Subject area of the problem
Data
Objective of the problem

The data science process starts with a need for analysis, a question, or a business objective. This is
possibly the most important step in the data science Process. Without a well-defined statement of the
problem, it is impossible to come up with the right dataset and pick the right data science algorithm.

Example : The business objective of this hypothetical case is: If the interest rate of past borrowers with
a range of credit scores is known, can the interest rate for a new borrower be predicted?
Subject area of the problem

The process of data science uncovers hidden patterns in the dataset by exposing relationships between
attributes. But the problem is that it uncovers a lot of patterns. The false or spurious signals are a major
problem in the data science process. It is up to the practitioner to sift through the exposed patterns and
accept the ones that are valid and relevant to the answer to the objective question. Hence, it is essential
to know the subject matter, the context, and the business process generating the data.

Example: The lending business is one of the oldest, most prevalent, and complex of al the businesses. If
the objective is to predict the lending interest rate, then it is important to know how the lending business
works,
Data

Understanding how the data is collected, stored, transformed, reported, and used is essential to the data
science process. There are quite a range of factors to consider: quality of the data, quantity of data. The
objective of this step is to come up with a dataset to answer the business question through the data
science process. For the following example, a sample dataset of ten data points with three attributes has
been put together: identifier, credit score, and interest rate.
Data Types
• Two types of data: Labelled Data & Unlabelled Data

• Labelled data

• Specially designated attribute and the aim is to use the data given to predict the value of that attribute
for instances that have not yet been seen. Data of this kind is called labelled.

Outlook=sun Temp= Humidity= Windy=fal Class


ny 79 88 se =?
Data Types
• Unlabelled data

• Data that does not have any specially designated attribute is called unlabelled.
• Here the aim is simply to extract the most information we can from the data available .
Learning Methods

• Supervised Learning
• Data mining using labelled data is known as supervised learning.

• Classification

• If the designated attribute is categorical, the task is called classification.

• Classification is one form of prediction, where the value to be predicted is a label.

• a hospital may want to classify medical patients into those who are at high, medium or low risk

• we may wish to classify a student project as distinction, merit, pass or fail

• Nearest Neighbour Matching, Classification Rules, Classification Tree, …


Learning Methods

• Numerical Prediction (Regression)

• If the designated attribute is numerical, the task is called numerical prediction (regression).

• Numerical prediction (often called regression) is another. In this case we wish to predict a
numerical value, such as a company’s profits or a share price.
Example: Suppose
• A very popular there
way of doing isuse
this is to a a marketing
Neural Network
company A, who does various
advertisement every year and get sales
on that. The list (right-side) shows the
advertisement made by the company in
the last 5 years and the corresponding
sales. Now, the company wants to do the
advertisement of $200 in the year 2019
and wants to know the prediction about
the sales for this year.
Learning Methods
• Regression (Numerical Prediction)

15
Learning Methods
• Unsupervised Learning
• Data mining using unlabelled data is known as unsupervised learning.

• Association Rules

• Sometimes we wish to use a training set to find any relationship that exists amongst the values of
variables, generally in the form of rules known as association rules.

• APRIORI

• Market Basket Analysis


Example, if a customer buys bread, he most likely can also
buy butter, eggs, or milk, so these products are stored
within a shelf or mostly nearby.
Possible associations include:
1.if customers purchase milk they also purchase
bread {milk} → {bread}
2.if customers purchase milk and eggs they also purchase
butter and bread {milk, eggs} → { butter, bread}
Learning Methods
• Clustering
• Clustering algorithms examine data to find groups of items that are
similar.
• K-Means Clustering, Agglomerative Hierarchical Clustering

17
A dataset (example set) is a collection of data with a defined structure. Table 2.1 shows a dataset. It
has a well-defined structure with 10 rows and 3 columns along with the column headers. This structure
is also sometimes referred to as a “data frame”.

A data point (record, object or example) is a single instance in the dataset. Each row in Table 2.1 is a
data point. Each instance contains the same structure as the dataset.

 An attribute (feature, input, dimension, variable, or predictor) is a single property of the dataset. Each
column in Table 2.1 is an attribute. Attributes can be numeric, categorical, date-time, text, or Boolean
data types. In this example, both the credit score and the interest rate are numeric attributes.
A label (class label, output, prediction, target, or response) is the special attribute to be predicted based
on all the input attributes. In Table 2.1, the interest rate is the output variable .
Data Preparation

 Data Exploration
 Data quality
 Missing values
 Noisy values
 Invalid values
 Data types and Conversion
 Transformation
 Outliers
 Feature selection
 Sampling
Data Exploration

Data preparation starts with an in-depth exploration of the data and understanding
the dataset better. Data exploration, also known as exploratory data analysis(EDA),
provides simple tools to understand the data fully. Exploratory Data Analysis (EDA)
is the process of analyzing and visualizing data to understand its structure, identify
patterns, detect outliers, and gather insights before applying machine learning
models .Data exploration approaches involve computing descriptive statistics and
visualization of data.
Data Exploration

22
Data Quality

Data quality is the measure of how well suited a data set is to serve its specific
purpose. Measures of data quality are based Data Correctness, Data Freshness and
Data Completeness.

Data Correctness: How accurately the data value describes real-world facts. In other words,
data correctness ensures that each data point is both accurate and reliable, meaning it reflects the truth as
closely as possible. Data is accurate when it represents real-world values without error. For example, if a
customer’s age is listed as 29 in a database, data accuracy would mean that this number actually reflects
the customer's real age.. There are many potential root causes of collection issues such as collection
noise, faulty data transformations, outdated data, or incorrect schema description.
Why Data Correctness is Important ?

Data correctness is critical because incorrect data can lead to poor decisions,
unreliable results, and even damage to an organization’s credibility.

For example: In healthcare, incorrect patient data could lead to inappropriate


treatment.
Data Freshness : This refers to how relevant the data is to describe the current state of an
entity, and takes into consideration the timeliness of the data and how frequently it is updated. Data is
correct when it is also up-to-date. Real-world values change over time, so data representing those values
must be updated accordingly.

Example: A customer’s address may change, and a database would need to reflect that change promptly
to stay accurate.
Data Completeness: Data completeness refers to whether all necessary values are
present in the dataset. For instance, if a dataset tracks employees, each employee entry should have
essential information like name, ID, position, etc. Missing data can lead to an incomplete view of reality,
which affects the correctness of the entire dataset.
Missing Values

In many real-world datasets data values are not recorded for all attributes. This can happen simply
because there are some attributes that are not applicable for some instances, a malfunction of the
equipment used to record the data, a data collection form to which additional fields were added after
some data had been collected, information that could not be obtained, e.g. about a hospital patient

k-nearest neighbor (k-NN) algorithm for classification tasks are often robust with missing values. Neural
network models for classification tasks do not perform well with missing attributes, and thus, the data
preparation step is essential for developing neural network models
Methods to Handle Missing Values

• Discard Instances

• Replace by Most Frequent/Average Value


Discard Instances

• This is the simplest strategy: delete all instances where there is at least one missing value and use
the remainder.

• It has the advantage of avoiding introducing any data errors. Its disadvantage is that discarding
data may damage the reliability of the results derived from the data
Replace by Most Frequent/Average Value

• A less cautious strategy is to estimate each of the missing values using the values that are present
in the dataset.

• A straightforward but effective way of doing this for a categorical attribute is to use its most
frequently occurring (non-missing) value

• In the case of continuous attributes, it is likely that no specific numerical value will occur more
than a small number of times. In this case the estimate used is generally the average value.
Noisy Values

• A noisy value to mean one that is valid for the dataset, but is incorrectly recorded

• The number 69.72 may accidentally be entered as 6.972, or a categorical attribute value such as
brown may accidentally be recorded as another of the possible values, such as blue.
Invalid Values

•69.7X for 6.972 or bbrown for brown

•An invalid value can easily be detected and either corrected or rejected
Data types and Conversion

The attributes in a dataset can be of different types, such as continuous numeric (interest rate), integer
numeric (credit score), or categorical. For example, the credit score can be expressed as categorical
values (poor, good, excellent) or numeric score. Different data science algorithms impose different
restrictions on the attribute data types.
Transformation

In some data science algorithms like k-NN, the input attributes are expected to be numeric and
normalized, because the algorithm compares the values of different attributes and calculates distance
between the data points. Normalization prevents one attribute dominating the distance results because of
large values. To overcome this problem ,we generally normalize the values of continuous attributes.

The idea is to make the values of each attribute run from 0 to 1. In general, if the lowest value of
attribute A is min and the highest value is max, we convert each value of A, say a, to (a − min)/(max −
min).
Transformation

First 6 rows of Normalized data (Min-Max


First 13 rows of the salary data
Normalization)
35
Outliers

Outliers are those data points that are significantly different from the rest of the dataset. They are often
abnormal observations that skew the data distribution, and arise due to inconsistent data entry, or
erroneous observations. Detecting outliers may be the primary purpose of some data science
applications, like fake email detection, fraud or intrusion detection.
Feature Selection

Reducing the number of attributes, without significant loss in the performance of the model, is called
feature selection.

many data science problems involve a dataset with hundreds to thousands of attributes. In text mining
applications, every distinct word in a document forms a distinct attribute in the dataset. Not all the
attributes are equally important or useful in predicting the target. The presence of some attributes might
be counter productive. Some of the attributes may be highly correlated with each other, like annual
income and taxes paid. A large number of attributes in the dataset significantly increases the complexity
of a model and may degrade the performance of the model due to the curse of dimensionality
Glaucoma Dataset (62
features)
Feature Selection

38
Glaucoma Dataset (62
features)
Feature Selection

Regularized Random Forest


– Variable Importance

39
Data Sampling

Sampling is a process of selecting a subset of records as a representation of the original dataset for use in
data analysis or modeling. The sample data serve as a representative of the original dataset with similar
properties, such as a similar mean. Sampling reduces the amount of data that need to be processed and
speeds up the build process of the modeling. In most cases, to gain insights, extract the information, and
to build representative predictive models it is sufficient to work with samples. Theoretically, the error
introduced by sampling impacts the relevancy of the model, but their benefits far outweigh the risks.
MODELING

A model is the abstract representation of the data and the relationships in a given dataset.
MODELING

Splitting Training and Test data sets: The modeling step creates a representative model inferred from the
data. The dataset used to create the model, with known attributes and target, is called the training
dataset.

The validity of the created model will also need to be checked with another known dataset called the
test dataset or validation dataset. To facilitate this process, the overall known dataset can be split into a
training dataset and a test dataset. A standard rule of thumb is two-thirds of the data are to be used as
training and one-third as a test dataset
Training Set and Test Set
MODELING

Splitting Training and Test data sets


MODELING
MODELING

Evaluation of test dataset


Application

Product readiness
Technical integration
Model response time
Remodeling
Assimilation
Knowledge

Posterior knowledge

You might also like