03 Data Science Process_Spring-24-25
03 Data Science Process_Spring-24-25
2
Example
3
CRISP Data Mining Framework
CRISP Data Mining Framework
The CRoss Industry Standard Process for Data Mining (CRISP-DM) is a process model that serves as
the base for a data science process. It has six sequential phases:
The data science process starts with a need for analysis, a question, or a business objective. This is
possibly the most important step in the data science Process. Without a well-defined statement of the
problem, it is impossible to come up with the right dataset and pick the right data science algorithm.
Example : The business objective of this hypothetical case is: If the interest rate of past borrowers with
a range of credit scores is known, can the interest rate for a new borrower be predicted?
Subject area of the problem
The process of data science uncovers hidden patterns in the dataset by exposing relationships between
attributes. But the problem is that it uncovers a lot of patterns. The false or spurious signals are a major
problem in the data science process. It is up to the practitioner to sift through the exposed patterns and
accept the ones that are valid and relevant to the answer to the objective question. Hence, it is essential
to know the subject matter, the context, and the business process generating the data.
Example: The lending business is one of the oldest, most prevalent, and complex of al the businesses. If
the objective is to predict the lending interest rate, then it is important to know how the lending business
works,
Data
Understanding how the data is collected, stored, transformed, reported, and used is essential to the data
science process. There are quite a range of factors to consider: quality of the data, quantity of data. The
objective of this step is to come up with a dataset to answer the business question through the data
science process. For the following example, a sample dataset of ten data points with three attributes has
been put together: identifier, credit score, and interest rate.
Data Types
• Two types of data: Labelled Data & Unlabelled Data
• Labelled data
• Specially designated attribute and the aim is to use the data given to predict the value of that attribute
for instances that have not yet been seen. Data of this kind is called labelled.
• Data that does not have any specially designated attribute is called unlabelled.
• Here the aim is simply to extract the most information we can from the data available .
Learning Methods
• Supervised Learning
• Data mining using labelled data is known as supervised learning.
• Classification
• a hospital may want to classify medical patients into those who are at high, medium or low risk
• If the designated attribute is numerical, the task is called numerical prediction (regression).
• Numerical prediction (often called regression) is another. In this case we wish to predict a
numerical value, such as a company’s profits or a share price.
Example: Suppose
• A very popular there
way of doing isuse
this is to a a marketing
Neural Network
company A, who does various
advertisement every year and get sales
on that. The list (right-side) shows the
advertisement made by the company in
the last 5 years and the corresponding
sales. Now, the company wants to do the
advertisement of $200 in the year 2019
and wants to know the prediction about
the sales for this year.
Learning Methods
• Regression (Numerical Prediction)
15
Learning Methods
• Unsupervised Learning
• Data mining using unlabelled data is known as unsupervised learning.
• Association Rules
• Sometimes we wish to use a training set to find any relationship that exists amongst the values of
variables, generally in the form of rules known as association rules.
• APRIORI
17
A dataset (example set) is a collection of data with a defined structure. Table 2.1 shows a dataset. It
has a well-defined structure with 10 rows and 3 columns along with the column headers. This structure
is also sometimes referred to as a “data frame”.
A data point (record, object or example) is a single instance in the dataset. Each row in Table 2.1 is a
data point. Each instance contains the same structure as the dataset.
An attribute (feature, input, dimension, variable, or predictor) is a single property of the dataset. Each
column in Table 2.1 is an attribute. Attributes can be numeric, categorical, date-time, text, or Boolean
data types. In this example, both the credit score and the interest rate are numeric attributes.
A label (class label, output, prediction, target, or response) is the special attribute to be predicted based
on all the input attributes. In Table 2.1, the interest rate is the output variable .
Data Preparation
Data Exploration
Data quality
Missing values
Noisy values
Invalid values
Data types and Conversion
Transformation
Outliers
Feature selection
Sampling
Data Exploration
Data preparation starts with an in-depth exploration of the data and understanding
the dataset better. Data exploration, also known as exploratory data analysis(EDA),
provides simple tools to understand the data fully. Exploratory Data Analysis (EDA)
is the process of analyzing and visualizing data to understand its structure, identify
patterns, detect outliers, and gather insights before applying machine learning
models .Data exploration approaches involve computing descriptive statistics and
visualization of data.
Data Exploration
22
Data Quality
Data quality is the measure of how well suited a data set is to serve its specific
purpose. Measures of data quality are based Data Correctness, Data Freshness and
Data Completeness.
Data Correctness: How accurately the data value describes real-world facts. In other words,
data correctness ensures that each data point is both accurate and reliable, meaning it reflects the truth as
closely as possible. Data is accurate when it represents real-world values without error. For example, if a
customer’s age is listed as 29 in a database, data accuracy would mean that this number actually reflects
the customer's real age.. There are many potential root causes of collection issues such as collection
noise, faulty data transformations, outdated data, or incorrect schema description.
Why Data Correctness is Important ?
Data correctness is critical because incorrect data can lead to poor decisions,
unreliable results, and even damage to an organization’s credibility.
Example: A customer’s address may change, and a database would need to reflect that change promptly
to stay accurate.
Data Completeness: Data completeness refers to whether all necessary values are
present in the dataset. For instance, if a dataset tracks employees, each employee entry should have
essential information like name, ID, position, etc. Missing data can lead to an incomplete view of reality,
which affects the correctness of the entire dataset.
Missing Values
In many real-world datasets data values are not recorded for all attributes. This can happen simply
because there are some attributes that are not applicable for some instances, a malfunction of the
equipment used to record the data, a data collection form to which additional fields were added after
some data had been collected, information that could not be obtained, e.g. about a hospital patient
k-nearest neighbor (k-NN) algorithm for classification tasks are often robust with missing values. Neural
network models for classification tasks do not perform well with missing attributes, and thus, the data
preparation step is essential for developing neural network models
Methods to Handle Missing Values
• Discard Instances
• This is the simplest strategy: delete all instances where there is at least one missing value and use
the remainder.
• It has the advantage of avoiding introducing any data errors. Its disadvantage is that discarding
data may damage the reliability of the results derived from the data
Replace by Most Frequent/Average Value
• A less cautious strategy is to estimate each of the missing values using the values that are present
in the dataset.
• A straightforward but effective way of doing this for a categorical attribute is to use its most
frequently occurring (non-missing) value
• In the case of continuous attributes, it is likely that no specific numerical value will occur more
than a small number of times. In this case the estimate used is generally the average value.
Noisy Values
• A noisy value to mean one that is valid for the dataset, but is incorrectly recorded
• The number 69.72 may accidentally be entered as 6.972, or a categorical attribute value such as
brown may accidentally be recorded as another of the possible values, such as blue.
Invalid Values
•An invalid value can easily be detected and either corrected or rejected
Data types and Conversion
The attributes in a dataset can be of different types, such as continuous numeric (interest rate), integer
numeric (credit score), or categorical. For example, the credit score can be expressed as categorical
values (poor, good, excellent) or numeric score. Different data science algorithms impose different
restrictions on the attribute data types.
Transformation
In some data science algorithms like k-NN, the input attributes are expected to be numeric and
normalized, because the algorithm compares the values of different attributes and calculates distance
between the data points. Normalization prevents one attribute dominating the distance results because of
large values. To overcome this problem ,we generally normalize the values of continuous attributes.
The idea is to make the values of each attribute run from 0 to 1. In general, if the lowest value of
attribute A is min and the highest value is max, we convert each value of A, say a, to (a − min)/(max −
min).
Transformation
Outliers are those data points that are significantly different from the rest of the dataset. They are often
abnormal observations that skew the data distribution, and arise due to inconsistent data entry, or
erroneous observations. Detecting outliers may be the primary purpose of some data science
applications, like fake email detection, fraud or intrusion detection.
Feature Selection
Reducing the number of attributes, without significant loss in the performance of the model, is called
feature selection.
many data science problems involve a dataset with hundreds to thousands of attributes. In text mining
applications, every distinct word in a document forms a distinct attribute in the dataset. Not all the
attributes are equally important or useful in predicting the target. The presence of some attributes might
be counter productive. Some of the attributes may be highly correlated with each other, like annual
income and taxes paid. A large number of attributes in the dataset significantly increases the complexity
of a model and may degrade the performance of the model due to the curse of dimensionality
Glaucoma Dataset (62
features)
Feature Selection
38
Glaucoma Dataset (62
features)
Feature Selection
39
Data Sampling
Sampling is a process of selecting a subset of records as a representation of the original dataset for use in
data analysis or modeling. The sample data serve as a representative of the original dataset with similar
properties, such as a similar mean. Sampling reduces the amount of data that need to be processed and
speeds up the build process of the modeling. In most cases, to gain insights, extract the information, and
to build representative predictive models it is sufficient to work with samples. Theoretically, the error
introduced by sampling impacts the relevancy of the model, but their benefits far outweigh the risks.
MODELING
A model is the abstract representation of the data and the relationships in a given dataset.
MODELING
Splitting Training and Test data sets: The modeling step creates a representative model inferred from the
data. The dataset used to create the model, with known attributes and target, is called the training
dataset.
The validity of the created model will also need to be checked with another known dataset called the
test dataset or validation dataset. To facilitate this process, the overall known dataset can be split into a
training dataset and a test dataset. A standard rule of thumb is two-thirds of the data are to be used as
training and one-third as a test dataset
Training Set and Test Set
MODELING
Product readiness
Technical integration
Model response time
Remodeling
Assimilation
Knowledge
Posterior knowledge