0% found this document useful (0 votes)
111 views126 pages

IDS Complete Notes

The document provides an introduction to data science through 4 units: Unit 1 covers the evolution of data science, roles in data science projects, applications in various fields, and data security issues. Unit 2 discusses data pre-processing including cleaning, integration, transformation, reduction and discretization as well as exploratory data analysis techniques like statistics, plots, and correlation. Unit 3 addresses model development like regression, evaluation using visualization, and model parameters. Unit 4 examines model evaluation, out-of-sample metrics, cross validation, overfitting/underfitting, and parameter selection using techniques such as ridge regression and grid search.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
111 views126 pages

IDS Complete Notes

The document provides an introduction to data science through 4 units: Unit 1 covers the evolution of data science, roles in data science projects, applications in various fields, and data security issues. Unit 2 discusses data pre-processing including cleaning, integration, transformation, reduction and discretization as well as exploratory data analysis techniques like statistics, plots, and correlation. Unit 3 addresses model development like regression, evaluation using visualization, and model parameters. Unit 4 examines model evaluation, out-of-sample metrics, cross validation, overfitting/underfitting, and parameter selection using techniques such as ridge regression and grid search.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 126

INTRODUCTION TO DATA SCIENCE

.
Unit – I: Introduction
Introduction to Data Science – Evolution of Data Science – Data Science Roles – Stages in
a Data Science Project – Applications of Data Science in various fields – Data Security
Issues. Data Collection Strategies, Data Categorization: NOIR Topology.

Unit – II: Data Pre-Processing & Exploratory Data Analysis


Data Pre-Processing Overview – Data Cleaning – Data Integration and Transformation –
Data Reduction – Data Discretization.
Descriptive Statistics – Mean, Standard Deviation, Skewness and Kurtosis – Box Plots –
Pivot Table – Heat Map – Correlation Statistics –ANOVA.

Unit – III: Model Development


Simple and Multiple Regression – Model Evaluation using Visualization – Residual Plot –
Distribution Plot – Polynomial Regression and Pipelines – Measures for In-sample
Evaluation – Prediction and DecisionMaking.

Unit – IV: Model Evaluation


Generalization Error – Out-of-Sample Evaluation Metrics – Cross Validation – Overfitting
– Under Fitting and Model Selection – Prediction by using Ridge Regression – Testing
Multiple Parameters by using GridSearch.

REFERENCES:

1. JojoMoolayil, “Smarter Decisions : The Intersection of IoT and Data


Science”, PACKT, 2016.
2. Cathy O’Neil and Rachel Schutt , “Doing Data Science”, O'Reilly,2015.
3. David Dietrich, Barry Heller, Beibei Yang, “Data Science and Big data
Analytics”, EMC 2013
4. Raj, Pethuru, “Handbook of Research on Cloud Infrastructures for Big
Data Analytics”, IGIGlobal.
Introduction to Data Science
Sl. UNIT-1 Topics
No
1 Introduction to Data Science
2 A brief history of Data Science Page 1
3 Data science role and Skill tracks Page 5 1.2
4 Problem type Page 15 1.3.2
5 List of potential data science careers Page 21 1.5
6 Life Cycle of Data Science (Stages of Data Science Project)
https://www.knowledgehut.com/blog/data-science/what-is-data-science-life-cycle
or https://www.javatpoint.com/life-cycle-phases-of-data-analytics
7 Application of data Science in various Field
https://www.geeksforgeeks.org/major-applications-of-data-science/
https://www.edureka.co/blog/data-science-applications/
8 Data Security Issues
https://www.imperva.com/learn/data-security/data-security/
9 Data collection strategies https://www.simplilearn.com/what-is-data-collection-article
10 Data Categororization: NOIR https://byjus.com/maths/types-of-data-in-statistics/
https://cse.iitkgp.ac.in/~dsamanta/courses/da/resources/tutorials/PS02%20Data%20Cate
gorization.pdf

1. Introduction to Data Science


What is Data Science?
Data Science can be explained as the entire process of gathering actionable insights from raw
data that involves various concepts that include statistical analysis, data analysis, machine learning
algorithms, data modeling, preprocessing of data, etc.
Let’s consider an example. A case study which also went to become a hollywood feature film
―Moneyball‖. In the movie, they have shown how an underdog team went on to compete at the highest
level of the baseball tournament by analyzing the statistical data points of each player and quantifying
their performances to win the game. It can be aligned with how data science actually works.
How does Data Science Work?
 Asking the correct questions and analyzing the raw data.
 Modeling the data using various complex and efficient algorithms.
 Visualizing the data to get a better perspective.
 Understanding the data to make better decisions and finding the final result.

Notes Prepared by Chandrakanta Mahanty, Assistant Professor


Department of CSE, 8093488380/8249119544 [email protected] Page 1
Example:
Let suppose we want to travel from station A to station B by car. Now, we need to take some decisions
such as which route will be the best route to reach faster at the location, in which route there will be no
traffic jam, and which will be cost-effective. All these decision factors will act as input data, and we will
get an appropriate answer from these decisions, so this analysis of data is called the data analysis, which
is a part of data science.
Need for Data Science:
In today's world, data is becoming so vast, i.e., approximately 2.5 quintals bytes of data is generating on
every day, which led to data explosion. It is estimated as per researches, that by 2020, 1.7 MB of data will
be created at every single second, by a single person on earth. Every Company requires data to work,
grow, and improve their businesses.
Now, handling of such huge amount of data is a challenging task for every organization. So to handle,
process, and analysis of this, we required some complex, powerful, and efficient algorithms and
technology, and that technology came into existence as data Science. Following are some main reasons
for using data science technology:
 With the help of data science technology, we can convert the massive amount of raw and unstructured
data into meaningful insights.
 Data science technology is opting by various companies, whether it is a big brand or a startup.
Google, Amazon, Netflix, etc, which handle the huge amount of data, are using data science
algorithms for better customer experience.
 Data science is working for automating transportation such as creating a self-driving car, which is the
future of transportation.
 Data science can help in different predictions such as various survey, elections, flight ticket
confirmation, etc.

2. A brief history of Data Science

Notes Prepared by Chandrakanta Mahanty, Assistant Professor


Department of CSE, 8093488380/8249119544 [email protected] Page 2
1. 1962 – Inception
a. Future of Data Analysis – In 1962, John W Tukey wrote the ―Future of Data Analysis‖ where he first
mentioned the importance of data analysis with respect to science rather than mathematics.
2. 1974
a. Concise Survey of Computer Methods – In 1974, Peter Naur published the ―Concise Survey of
Computer methods that surveys the contemporary methods of data processing in various applications.
3. 1974 – 1980
a. International Association For Statistical Computing – In 1997, The committee was formed whose
sole purpose is to link traditional statistical methodology with modern computer technology to extract
useful information and knowledge from the data.
4. 1980-1990
a. Knowledge Discovery in Databases – In 1989, Gregory Piatetsky-Shapiro chaired the Knowledge
Discovery in Databases that later went on to become the annual conference on knowledge discovery and
data mining.
5. 1990-2000
a. Database Marketing – In 1994, BusinessWeek published a cover story that explains how big
organizations are using the customer data to predict the likelihood of a customer buying a specific product
or not. Kind of like how targeted ads work in the modern era for social media campaigns.
b. International Federation of Classification Society – For the first time in 1996, the term ―Data
Science‖ was used in a conference held in Japan.
6. 2000-2010
a. Data Science – An Action Plan for Expanding the Technical Areas of the Field of Statistics – In 2001,
William S Cleveland published the action plan, that majorly focused on major areas of the technical work
in the field of statistics and coined the term Data Science.
b. Statistical Modeling – The Two Cultures – In 2001, Leo Breiman wrote ―There are two cultures in the
use of statistical modeling to reach conclusions from data. One assumes that the data are generated by a
given stochastic data model. The other uses algorithmic models and treats the data mechanism as
unknown‖.
c. Data Science Journal – April 2002 saw the launch of a journal that focused on management of data
and databases in science and technology.
7. 2010-Present
a. Data Everywhere – In February 2010, Kenneth Cukier wrote a special report for The Economist that
said a new professional has arrived – a data scientist. Who combines the skills of software programmer,
statistician and storyteller/artist to extract the nuggets of gold hidden under mountains of data.
b. What is Data Science? – In June 2010, Mike Loukides described data science as combining
entrepreneurship with patience, the willingness to build data products incrementally, the ability to
explore, and the ability to iterate over a solution.
3. Data science role and Skill tracks
Data science has three main skill tracks: engineering, analysis, and modeling.
1. Engineering
Data engineering is the foundation that makes everything else possible. It mainly involves in building the
data pipeline infrastructure. It involves the software and the hardware used to store the data and perform
data ETL (i.e., extract, transform, and load) process. As cloud service development, it becomes the new
norm to store and compute data on the cloud.
(a) Data environment
Designing and setting up the entire environment to support data science workflow is the prerequisite for
data science projects. It may include setting up storage in the cloud, Kafka platform, Hadoop and Spark
cluster, etc.
(b) Data management

Notes Prepared by Chandrakanta Mahanty, Assistant Professor


Department of CSE, 8093488380/8249119544 [email protected] Page 3
Automated data collection is a common task that includes parsing the logs (depending on the stage of the
company and the type of industry you are in), web scraping, API queries, and interrogating data streams.
Determine and construct data schema to support analytical and modeling needs. Use tools, processes,
guidelines to ensure data is correct, standardized, and documented.
(c) Production
It involves the whole pipeline from data access, preprocessing, modeling to final deployment. It is
necessary to make the system work smoothly with all existing software stacks. So, it requires to monitor
the system through some robust measures, such as rigorous error handling, fault tolerance, and graceful
degradation to make sure the system is running smoothly and the users are happy.
2. Analysis
Analysis turns raw information into insights in a fast and often exploratory way.
(a) Domain knowledge
Domain knowledge is the understanding of the organization or industry where you apply data science.
Some questions about the context are:
 What are the business questions?
 How to translate a business need to a data problem?
Domain knowledge helps you to deliver the results in an audience-friendly way with the right solution to
the right problem.
(b) Exploratory analysis
This type of analysis is about exploration and discovery. It often involves different ways to slice and
aggregate data.
(c) Storytelling
Storytelling with data is critical to deliver insights and drive better decision making. It usually requires
data summarization, aggregation, and visualization. A business-friendly report or an interactive dashboard
is the typical outcome of the analysis.
3. Modeling
Modeling is a process that dives deeper into the data to discover the pattern.
(a) Supervised learning
Supervised learning happens in the presence of a supervisor just like learning performed by a small child
with the help of his teacher. As a child is trained to recognize fruits, colors, numbers under the
supervision of a teacher this method is supervised learning. In this method, every step of the child is
checked by the teacher and the child learns from the output that he has to produce.
(b) Unsupervised learning
Unsupervised learning happens without the help of a supervisor just like a fish learns to swim by itself. It
is an independent learning process. In this model, as there is no output mapped with the input, the target
values are unknown/ unlabeled. The system needs to learn by itself from the data input to it and detect the
hidden patterns.
(c) Customized model development
A data scientist may need to develop new models to accommodate the features of the problem at hand.
Here is a list of questions that can help you decide the type of technique to use:
 Is your data labeled?
 Is your data easy to collect?
4. Some common skills
Data Preprocessing:
Data preprocessing is the process of converting raw data into clean data that is proper to use.
(a) Data preprocessing for data engineer
A data lake is a storage repository that stores a vast amount of raw data in its native format, including
XML, JSON, CSV, Parquet, etc. It is a data cesspool rather than a data lake. The data engineer’s job is to
get a clean schema out of the data lake by transforming and formatting the data.
(b) Data preprocessing for data analyst and scientist

Notes Prepared by Chandrakanta Mahanty, Assistant Professor


Department of CSE, 8093488380/8249119544 [email protected] Page 4
A data analyst collects and stores data on sales numbers, market research, logistics, linguistics, or other
behaviors. They bring technical expertise to ensure the quality and accuracy of that data, then process,
design, and present it in ways to help people, businesses, and organizations make better decisions.
Data Scientist:
A data scientist is a professional who works with an enormous amount of data to come up with
compelling business insights through the deployment of various tools, techniques, methodologies,
algorithms, etc.
4. Problem type
1. Description
The primary analytic problem is to summarize and explore a data set with descriptive statistics (mean,
standard deviation, and so forth) and visualization methods. Questions of this kind are:
 What is the annual income distribution?
 What are the mean active days of different accounts?
2. Comparison
The first common modeling problem is to compare different groups.
Here are some examples:
 Are males more inclined to buy our products than females?
 Are there any differences in customer satisfaction in different business districts?
The commonly used statistical tests are chi-square test, t-test, and ANOVA.
3. Clustering
Clustering is a widespread problem, and it can answer questions like:
 How many reasonable customer segments are there based on historical purchase patterns?
Clustering is an unsupervised learning mechanism. Unsupervised learning happens without the help of a
supervisor just like a fish learns to swim by itself. It is an independent learning process. In this model, as
there is no output mapped with the input, the target values are unknown/ unlabeled. The system needs to
learn by itself from the data input to it and detect the hidden patterns.
What Is Unlabeled Dataset?
A dataset with unknown output values for all the input values is called an unlabeled dataset. For
Example, while buying products online, if butter is put in the cart, then it suggests buying bread, cheese,
etc. The unsupervised model looks at the data points and predicts the other attributes that are associated
with the product.
The unsupervised learning algorithms include Clustering and Association Algorithms such as:
Apriori, K-means clustering and other association rule mining algorithms.
Clustering Algorithm: The methods of finding the similarities between data items such as the same
shape, size, color, price, etc. and grouping them to form a cluster is cluster analysis.
Association Rule Mining: In this type of mining, it finds out the most frequently occurring itemsets or
associations between elements. Associations such as ―products often purchased together‖, etc.
4. Classification
Here are some example questions:
 Will this customer likely to buy our product?
 Is it spam email or not?
Classification is a supervised learning mechanism. Supervised learning happens in the presence of a
supervisor just like learning performed by a small child with the help of his teacher. As a child is trained
to recognize fruits, colors, numbers under the supervision of a teacher this method is supervised learning.
In this method, every step of the child is checked by the teacher and the child learns from the output that
he has to produce.
What Is a Labeled Dataset?
The dataset with outputs known for a given input is called a Labeled Dataset. For example, an image of
fruit along with the fruit name is known. So when a new image of fruit is shown, it compares with the
training set to predict the answer.
Notes Prepared by Chandrakanta Mahanty, Assistant Professor
Department of CSE, 8093488380/8249119544 [email protected] Page 5
Supervised learning is a fast learning mechanism with high accuracy. The supervised learning problems
include regression and classification problems.
Some of the supervised learning algorithms are:
Decision Trees, K-Nearest Neighbor, Linear Regression, and Neural Networks.
Classification: In these types of problems, we predict the response as specific classes, such as ―yes‖ or
―no‖. When only 2 classes are present, then it is called a Binary Classification. For more than 2 class
values, it is called a Multi-class Classification. The predicted response values are discrete values.
For example, Is it the image of the sun or the moon? The classification algorithm separates the data into
classes.
Difference Between Supervised Vs Unsupervised Learning
Supervised Unsupervised

In supervised learning algorithms, the output In unsupervised learning algorithms, the output for the
for the given input is known. given input is unknown.

The algorithms learn from labeled set of The algorithm is provided with unlabeled data where it
data. This data helps in evaluating the tries to find patterns and associations in between the data
accuracy on training data. items.

It is a Predictive Modeling technique which It is a Descriptive Modeling technique which explains


predicts the future outcomes accurately. the real relationship between the elements and history of
the elements.

It includes classification and regression It includes clustering and association rules learning
algorithms. algorithms.

Some algorithms of supervised learning are Some algorithms for unsupervised learning are k- means
Linear Regression, Naïve Bayes, and Neural clustering, Apriori, etc.
Networks.

It is more accurate than unsupervised It has less accuracy as the input data is unlabeled. Thus
learning as input data and corresponding the machine has to first understand and label the data
output is well known, and the machine only and then give predictions.
needs to give predictions.
5. Regression
Here are some example questions:
 What will be the temperature tomorrow?
 How much inventory should we have?
Regression: Regression problems predict the response as continuous values such as predicting a value
that ranges from -infinity to infinity. It may take many values. For example, the linear regression
algorithm that is applied, predicts the cost of the house based on many parameters such as location,
nearby airport, size of the house, etc.
Regression analysis is a statistical method to model the relationship between a dependent (target) and
independent (predictor) variables with one or more independent variables.
Regression analysis helps us to understand how the value of the dependent variable is changing
corresponding to an independent variable when other independent variables are held fixed. It predicts
continuous/real values such as temperature, age, salary, price, etc.

Notes Prepared by Chandrakanta Mahanty, Assistant Professor


Department of CSE, 8093488380/8249119544 [email protected] Page 6
6. Optimization
Optimization is another common type of problems in data science to find an optimal solution by tuning a
few tune-able variables with other non-controllable environmental variables. It is an expansion of
comparison problem and can solve problems such as:
 What is the best route to deliver the packages?
5. List of potential data science careers
Data infrastructure engineer
Designing, building, and running the data infrastructure to support the Video organizations growing data
needs.
Go, Python, AWS/Google Cloud/Azure, logstash, Kafka, and Hadoop
Data Engineer:
A data engineer works with massive amount of data and responsible for building and maintaining the data
architecture of a data science project. Data engineer also works for the creation of data set processes used
in modeling, mining, acquisition, and verification.
Skill required: Data engineer must have depth knowledge of SQL, MongoDB, Cassandra, HBase,
Apache Spark, Hive, MapReduce, with language knowledge of Python, C/C++, Java, Perl, etc.
spark/scala, python, SQL, AWS/Google Cloud/Azure, Data modeling
BI engineer
Design, implement, and maintain systems used to collect and analyze business intelligence data. They
create dashboards, databases, and other platforms that allow for efficient collection and evaluation of BI
data.
Skill required: Tableau/looker/Mode, etc., data visualization, SQL, Python
Data Analyst:
Data analyst is an individual, who performs mining of huge amount of data, models the data, looks for
patterns, relationship, trends, and so on. At the end of the day, he comes up with visualization and
reporting for analyzing the data for decision making and problem-solving process.
Skill required: For becoming a data analyst, you must get a good background in mathematics, business
intelligence, data mining, and basic knowledge of statistics. You should also be familiar with some
computer languages and tools such as MATLAB, Python, SQL, Hive, Pig, Excel, SAS, R, JS, Spark,
etc.
Data Scientist:
A data scientist is a professional who works with an enormous amount of data to come up with
compelling business insights through the deployment of various tools, techniques, methodologies,
algorithms, etc.
Skill required: To become a data scientist, one should have technical language skills such as R, SAS,
SQL, Python, Hive, Pig, Apache spark, MATLAB. Data scientists must have an understanding of
Statistics, Mathematics, visualization, and communication skills.
Research scientist
 Building research proposals
 Creating and conducting experiments
 Analysing results of the experiments
 Working with other researchers to use
and develop end product
 Applying for grants to continue research
Skill required: R/Python, advanced statistics,
experimental design, ML, research background,
publications, conference contributions, algorithms
Applied scientist
An applied scientist is more interested in real-life applications.

Notes Prepared by Chandrakanta Mahanty, Assistant Professor


Department of CSE, 8093488380/8249119544 [email protected] Page 7
An applied scientist does scientific research with a focus on applying the results of their studies to solving
real-world problems. They use the scientific method to develop research questions and then conduct
studies that lead to practical solutions.
Applied Scientists at Amazon, for example, focus on projects to enhance Amazon’s customer experience
like Amazon’s Automatic Speech Recognition (ASR), Natural Language Understanding (NLU), Audio
Signal Processing, text-to-speech (TTS)
Skill required: ML algorithm design, often with an expectation of fundamental software engineering
skills
Machine Learning Engineer
Their tasks involve researching, building, and designing the artificial intelligence responsible for machine
learning and maintaining and improving existing artificial intelligence systems.
Machine learning engineers design and create the AI algorithms capable of learning and making
predictions that define machine learning (ML). An ML engineer typically works as part of a larger data
science team and will communicate with data scientists, administrators, data analysts, data engineers and
data architects. They should also have an understanding of various algorithms, problem-solving analytical
skill, probability, and statistics.
Skill required: More advanced software engineering skillset, algorithms, machine learning algorithm
design, system design
6. Life Cycle of Data Science (Stages of Data Science Project)
The Lifecycle of Data Science
The major steps in the life cycle of Data Science project are as follows:
1. Problem identification
Domain experts and Data Scientists are the key persons in the problem identification of problem.
Domain expert has in depth knowledge of the application domain and exactly what is the problem to
be solved. Data Scientist understands the domain and help in identification of problem and possible
solutions to the problems.
2. Business Understanding
Understanding what customer exactly wants from the business perspective is nothing but Business
Understanding. Whether customer wish to do predictions or want to improve sales or minimise the
loss or optimise any particular process etc forms the business goals. During business understanding
two important steps are followed:
 KPI (Key Performance Indicator)
For any data science project, key performance indicators define the performance or success of the
project. There is a need to be an agreement between the customer and data science project team
on Business related indicators and related data science project goals. Depending on the business need
the business indicators are devised and then accordingly the data science project team decides the
goals and indicators. To better understand this let us see an example. Suppose the business need is
to optimise the overall spending of the company, then the data science goal will be to use the existing
resources to manage double the clients. Defining the Key performance Indicators is very crucial for
any data science projects as the cost of the solutions will be different for different goals.
 SLA (Service Level Agreement)
Once the performance indicators are set then finalizing the service level agreement is important. As
per the business goals the service level agreement terms are decided. For example, for any airline
reservation system simultaneous processing of say 1000 users is required. Then the product must
satisfy this service requirement is the part of service level agreement.
Once the performance indicators are agreed and service level agreement is completed then the project
proceeds to the next important step.
3. Collecting Data
Data Collection is the important step as it forms the important base to achieve targeted business goals.

Notes Prepared by Chandrakanta Mahanty, Assistant Professor


Department of CSE, 8093488380/8249119544 [email protected] Page 8
The basic data collection can be done using the surveys. Generally, the data collected through surveys
provide important insights. Much of the data is collected from the various processes followed in the
enterprise. At various steps the data is recorded in various software systems used in the en terprise
which is important to understand the process followed from the product development to deployment
and delivery. The historical data available through archives is also important to better understand the
business. Transactional data also plays a vital role as it is collected on a daily basis. Many statistical
methods are applied to the data to extract the important information related to business. In data
science project the major role is played by data and so proper data collection methods are important.
4. Pre-processing data
Large data is collected from archives, daily transactions and intermediate records. The data is
available in various formats and in various forms. Some data may be available in hard copy formats
also. The data is scattered at various places on various servers. All these data are extracted and
converted into single format and then processed. As data warehouse is constructed where the Extract,
Transform and Loading (ETL) process or operations are carried out. A data architect role is important
in this stage who decides the structure of data warehouse and perform the steps of ETL operations.
5. Analyzing data
Now that the data is available and ready in the format required then next important step is to
understand the data in depth. This understanding comes from analysis of data using various statistical
tools available. A data engineer plays a vital role in analysis of data. This step is also called as
Exploratory Data Analysis (EDA). Here the data is examined by formulating the various statistical
functions and dependent and independent variables or features are identified. Careful analysis of data
revels which data or features are important and what is the spread of data. Various plots are utilized to
visualize the data for better understanding. The tools like Tableau, PowerBI etc are famous for
performing Exploratory Data Analysis and Visualization. Knowledge of Data Science with Python and
R is important for performing EDA on any type of data.
6. Data Modelling
Data modelling is the important next step once the data is analysed and visualized. The important
components are retained in the dataset and thus data is further refined. Now the important is to decide
how to model the data? What tasks are suitable for modelling? The tasks, like classification or
regression, which is suitable is dependent upon what business value is required. In these tasks also
many ways of modelling are available. The Machine Learning engineer applies various algorithms to
the data and generates the output. While modelling the data many a times the models are first tested on
dummy data similar to actual data.
7. Model Evaluation/ Monitoring
As there are various ways to model the data so it is important to decide which one is effective. For that
model evaluation and monitoring phase is very crucial and important. The model is now tested with
actual data. The data may be very few and in that case the output is monitored for improvement. There
may be changes in data while model is being evaluated or tested and the output will drastica lly change
depending on changes in data. So, while evaluating the model following two phases are important:
8. Model Training
Once the task and the model are finalised and data drift analysis modelling is finalized then the
important step is to train the model. The training can be done is phases where the important
parameters can be further fine tuned to get the required accurate output. The model is exposed to the
actual data in production phase and output is monitored.
9. Model Deployment
Once the model is trained with the actual data and parameters are fine tuned then model is deployed.
Now the model is exposed to real time data flowing into the system and output is generated. The
model can be deployed as web service or as an embedded application in edge or mobile application.
This is very important step as now model is exposed to real world.

Notes Prepared by Chandrakanta Mahanty, Assistant Professor


Department of CSE, 8093488380/8249119544 [email protected] Page 9
10. Driving insights and generating BI reports
After model deployment in real world, next step is to find out how model is behaving in real world
scenario. The model is used to get the insights which aid in strategic decisions related to business. The
business goals are bound to these insights. Various reports are generated to see how business is
driving. These reports help in finding out if key process indicators are achieved or not.
11. Taking a decision based on insight
For data science to make wonders, every step indicated above has to be done very carefully and
accurately. When the steps are followed properly then the reports generated in above step helps in
taking key decisions for the organization. The insights generated helps in taking strategic decisions
like for example the organization can predict that there will be need of raw material in advance. The
data science can be of great help in taking many important decisions related to business growth and
better revenue generation.
7. Application of data Science in various Field
1. In Search Engines
The most useful application of Data Science is Search Engines. As we know when we want to search
for something on the internet, we mostly used Search engines like Google, Yahoo, Safari, Firefox, etc.
So Data Science is used to get Searches faster.
For Example, When we search something suppose ―Data Structure and algorithm courses‖ then at that
time on the Internet Explorer we get the first link of GeeksforGeeks Courses. This happens because the
GeeksforGeeks website is visited most in order to get information regarding Data Structure courses and
Computer related subjects. So this analysis is done using Data Science, and we get the topmost visited
Web Links.
2. In Transport
Data Science also entered into the Transport field like Driverless Cars. With the help of Driverless
Cars, it is easy to reduce the number of Accidents.
For Example, In Driverless Cars the training data is fed into the algorithm and with the help of Data
Science techniques, the Data is analyzed like what is the speed limit in Highway, Busy Streets, Narrow
Roads and how to handle different situations while driving etc.
3. In Finance
Data Science plays a key role in Financial Industries. Financial Industries always have an issue of fraud
and risk of losses. Thus, Financial Industries needs to automate risk of loss analysis in order to carry
out strategic decisions for the company. Also, Financial Industries uses Data Science Analytics tools in
order to predict the future. It allows the companies to predict customer lifetime value and their stock
market moves.
For Example, In Stock Market, Data Science is the main part. In the Stock Market, Data Science is
used to examine past behavior with past data and their goal is to examine the future outcome. Data is
analyzed in such a way that it makes it possible to predict future stock prices over a set timetable.
4. In E-Commerce
E-Commerce Websites like Amazon, Flipkart, etc. uses data Science to make a better user experience
with personalized recommendations.
For Example, When we search for something on the E-commerce websites we get suggestions similar
to choices according to our past data and also we get recommendations according to most buy the
product, most rated, most searched, etc. This is all done with the help of Data Science.
5. In Health Care
Data Science is used for: Detecting Tumor, Drug discoveries, Medical Image Analysis, Virtual
Medical Bots, Genetics and Genomics, Predictive Modeling for Diagnosis etc.
6. Image Recognition
Currently, Data Science is also used in Image Recognition. For Example, When we upload our image
with our friend on Facebook, Facebook gives suggestions Tagging who is in the picture. This is done
with the help of machine learning and Data Science. When an Image is Recognized, the data analysis is

Notes Prepared by Chandrakanta Mahanty, Assistant Professor


Department of CSE, 8093488380/8249119544 [email protected] Page 10
done on one’s Facebook friends and after analysis, if the faces which are present in the picture matched
with someone else profile then Facebook suggests us auto-tagging.
7. Targeting Recommendation
Targeting Recommendation is the most important application of Data Science. Whatever the user
searches on the Internet, he/she will see numerous posts everywhere. Suppose I want a mobile phone,
so I just Google search it and after that, I changed my mind to buy offline. Data Science helps those
companies who are paying for Advertisements for their mobile. So everywhere on the internet in the
social media, in the websites, in the apps everywhere I will see the recommendation of that mobile
phone which I searched for. So this will force me to buy online.
8. Airline Routing Planning
With the help of Data Science, Airline Sector is also growing like with the help of it, it becomes easy to
predict flight delays. It also helps to decide whether to directly land into the destination or take a halt in
between like a flight can have a direct route from Delhi to the U.S.A or it can halt in between after that
reach at the destination.
9. Data Science in Gaming
In most of the games where a user will play with an opponent i.e. a Computer Opponent, data science
concepts are used with machine learning where with the help of past data the Computer will improve its
performance. There are many games like Chess, EA Sports, etc. will use Data Science concepts.
10. Medicine and Drug Development
The process of creating medicine is very difficult and time-consuming and has to be done with full
disciplined because it is a matter of Someone’s life. Without Data Science, it takes lots of time,
resources, and finance or developing new Medicine or drug but with the help of Data Science, it
becomes easy because the prediction of success rate can be easily determined based on biological data
or factors. The algorithms based on data science will forecast how this will react to the human body
without lab experiments.
11. In Delivery Logistics
Various Logistics companies like DHL, FedEx, etc. make use of Data Science. Data Science helps
these companies to find the best route for the Shipment of their Products, the best time suited for
delivery, the best mode of transport to reach the destination, etc.
12. Autocomplete
AutoComplete feature is an important part of Data Science where the user will get the facility to just
type a few letters or words, and he will get the feature of auto-completing the line. In Google Mail,
when we are writing formal mail to someone so at that time data science concept of Autocomplete
feature is used where he/she is an efficient choice to auto-complete the whole line. Also in Search
Engines in social media, in various apps, AutoComplete feature is widely used.
13. Augmented Reality:
Data Science and Virtual Reality do have a relationship, considering a VR headset contains computing
knowledge, algorithms and data to provide you with the best viewing experience. A very small step
towards this is the high-trending game of Pokemon GO.
8. Data Security Issues
What is Data Security?
Data security is the process of protecting corporate data and preventing data loss through unauthorized
access. This includes protecting your data from attacks that can encrypt or destroy data, such as
ransomware, as well as attacks that can modify or corrupt your data. Data security also ensures data is
available to anyone in the organization who has access to it.
Why is Data Security important?
Data is a valuable asset that generates, acquires, saves, and exchanges for any company. Protecting it
from internal or external corruption and illegal access protects a company from financial loss, reputational
harm, consumer trust degradation, and brand erosion.
Main elements of Data Security

Notes Prepared by Chandrakanta Mahanty, Assistant Professor


Department of CSE, 8093488380/8249119544 [email protected] Page 11
 Confidentiality: Ensures that only authorized users, with appropriate credentials, have access to data.
 Integrity: Ensures that all data is accurate, trustworthy, and not prone to unjustified changes.
 Availability: Ensures that data is accessible and available for ongoing business needs in a timely and
secure manner.
Data Privacy:
There are two main aspects to enforcing data privacy:
Access control—ensuring that anyone who tries to access the data is authenticated to confirm their
identity, and authorized to access only the data they are allowed to access.
Data protection— ensuring that even if unauthorized parties manage to access the data, they cannot view
it or cause damage to it. Data protection methods ensure encryption, which prevents anyone from viewing
data if they do not have a private encryption key, and data loss prevention mechanisms which prevent
users from transferring sensitive data outside the organization.
The primary difference is that data privacy mainly focuses on keeping data confidential, while data
security mainly focuses on protecting from malicious activity.
Differentiate between Data Privacy and Data Security
Data Privacy Data Security

Data Privacy is all about the reflection of Data Security is all about the reflection of
what data is important and why. how those policies got enforced.

Data privacy sets about proper usage, Data security sets the policies, methods, and
collection, retention, deletion, and storage of means to secure personal data.
data.

It offers to block websites, internet It offers to protect you from other people
browsers, cable companies, and internet accessing your personal information and
service providers from tracking your other data.
information and your browser history.

Data privacy tools include browser Data Security tools involve with identity and
extensions and add-on, password managers, access management, data loss prevention,
private browsers and email services, anti-malware, anti-virus, event management
encrypted messaging, private search and data masking software.
engines, web proxies, file encryption
software, and ad and tracker blockers.

For e.g. The European Union’s General Data For e.g. The Payment Card Industry Data
Protection Regulation is a type of Security Standard is a set of rules which
international standard for protecting the protect the sensitive payment card
privacy of EU citizens. information and cardholder data.

Data Security Risks


1. Accidental Exposure
A large percentage of data breaches are not the result of a malicious attack but are caused by negligent or
accidental exposure of sensitive data. It is common for an organization’s employees to share, grant access
to, lose, or mishandle valuable data, either by accident or because they are not aware of security policies.

Notes Prepared by Chandrakanta Mahanty, Assistant Professor


Department of CSE, 8093488380/8249119544 [email protected] Page 12
This major problem can be addressed by employee training, but also by other measures, such as data loss
prevention (DLP) technology and improved access controls.
A data breach or data leak is the release of sensitive, confidential or protected data to an untrusted
environment.
2. Phishing and Other Social Engineering Attacks
Social engineering attacks are a primary vector used by attackers to access sensitive data. They involve
manipulating or tricking individuals into providing private information or access to privileged accounts.
Phishing is a common form of social engineering. It involves messages that appear to be from a trusted
source, but in fact are sent by an attacker. When victims comply, for example by providing private
information or clicking a malicious link, attackers can compromise their device or gain access to a
corporate network.
3. Insider Threats
Insider threats are employees who inadvertently or intentionally threaten the security of an organization’s
data. There are three types of insider threats:
Non-malicious insider—these are users that can cause harm accidentally, via negligence, or because they
are unaware of security procedures.
Malicious insider—these are users who actively attempt to steal data or cause harm to the organization
for personal gain.
Compromised insider—these are users who are not aware that their accounts or credentials were
compromised by an external attacker. The attacker can then perform malicious activity, pretending to be a
legitimate user.
4. Ransomware
Ransomware is a major threat to data in companies of all sizes. Ransomware is malware that infects
corporate devices and encrypts data, making it useless without the decryption key. Attackers display a
ransom message asking for payment to release the key, but in many cases, even paying the ransom is
ineffective and the data is lost.
If an organization does not maintain regular backups, or if the ransomware manages to infect the backup
servers, there may be no way to recover.
5. Data Loss in the Cloud
Many organizations are moving data to the cloud to facilitate easier sharing and collaboration. However,
when data moves to the cloud, it is more difficult to control and prevent data loss. Users access data from
personal devices and over unsecured networks. It is all too easy to share a file with unauthorized parties,
either accidentally or maliciously.
6. SQL Injection
SQL injection (SQLi) is a common technique used by attackers to gain illicit access to databases, steal
data, and perform unwanted operations. It works by adding malicious code to a seemingly innocent
database query.
SQL injection manipulates SQL code by adding special characters to a user input that change the context
of the query. The database expects to process a user input, but instead starts processing malicious code
that advances the attacker’s goals. SQL injection can expose customer data, intellectual property, or give
attackers administrative access to a database, which can have severe consequences.
SQL injection vulnerabilities are typically the result of insecure coding practices. It is relatively easy to
prevent SQL injection if coders use secure mechanisms for accepting user inputs, which are available in
all modern database systems.
Common Data Security Solutions and Techniques
There are several technologies and practices that can improve data security.
1. Data Discovery and Classification
Modern IT environments store data on servers, endpoints, and cloud systems. Visibility over data flows is
an important first step in understanding what data is at risk of being stolen or misused. To properly

Notes Prepared by Chandrakanta Mahanty, Assistant Professor


Department of CSE, 8093488380/8249119544 [email protected] Page 13
protect your data, you need to know the type of data, where it is, and what it is used for. Data
discovery and classification tools can help.
Data detection is the basis for knowing what data you have. Data classification allows you to create
scalable security solutions, by identifying which data is sensitive and needs to be secured. Data detection
and classification solutions enable tagging files on endpoints, file servers, and cloud storage systems,
letting you visualize data across the enterprise, to apply the appropriate security policies.
2. Data Masking
Data masking lets you create a synthetic version of your organizational data, which you can use for
software testing, training, and other purposes that don’t require the real data. The goal is to protect data
while providing a functional alternative when needed.
Data masking retains the data type, but changes the values. Data can be modified in a number of ways,
including encryption, character shuffling and character or word substitution. Whichever method you
choose, you must change the values in a way that cannot be reverse-engineered.
3. Identity Access Management
Identity and Access Management (IAM) is a business process, strategy, and technical framework that
enables organizations to manage digital identities. IAM solutions allow IT administrators to control user
access to sensitive information within an organization.
Systems used for IAM include single sign-on systems, two-factor authentication, multi-factor
authentication, and privileged access management. These technologies enable the organization to securely
store identity and profile data, and support governance, ensuring that the appropriate access policies are
applied to each part of the infrastructure.
4. Data Encryption
Data encryption is a method of converting data from a readable format (plaintext) to an unreadable
encoded format (ciphertext). Only after decrypting the encrypted data using the decryption key, the data
can be read or processed.
In public-key cryptography techniques, there is no need to share the decryption key – the sender and
recipient each have their own key, which are combined to perform the encryption operation. This is
inherently more secure.
Data encryption can prevent hackers from accessing sensitive information. It is essential for most security
strategies and is explicitly required by many compliance standards.
5. Data Loss Prevention (DLP)
To prevent data loss, organizations can use a number of safeguards, including backing up data to another
location. Physical redundancy can help protect data from natural disasters, outages, or attacks on local
servers. Redundancy can be performed within a local data center, or by replicating data to a remote site or
cloud environment.
Beyond basic measures like backup, DLP software solutions can help protect organizational data. DLP
software automatically analyzes content to identify sensitive data, enabling central control and
enforcement of data protection policies, and alerting in real-time when it detects anomalous use of
sensitive data, for example, large quantities of data copied outside the corporate network.
6. Governance, Risk, and Compliance (GRC)
GRC is a methodology that can help improve data security and compliance:
Governance creates controls and policies enforced throughout an organization to ensure compliance and
data protection.
Risk involves assessing potential cybersecurity threats and ensuring the organization is prepared for them.
Compliance ensures organizational practices are in line with regulatory and industry standards when
processing, accessing, and using data.
7. Password Hygiene
One of the simplest best practices for data security is ensuring users have unique, strong passwords.
Without central management and enforcement, many users will use easily guessable passwords or use the

Notes Prepared by Chandrakanta Mahanty, Assistant Professor


Department of CSE, 8093488380/8249119544 [email protected] Page 14
same password for many different services. Password spraying and other brute force attacks can easily
compromise accounts with weak passwords.
A simple measure is enforcing longer passwords and asking users to change passwords frequently.
However, these measures are not enough, and organizations should consider multi-factor authentication
(MFA) solutions that require users to identify themselves with a token or device they own, or via
biometric means.
Another complementary solution is an enterprise password manager that stores employee passwords in
encrypted form, reducing the burden of remembering passwords for multiple corporate systems, and
making it easier to use stronger passwords. However, the password manager itself becomes a security
vulnerability for the organization.
8. Authentication and Authorization
Organizations must put in place strong authentication methods, such as OAuth 2.0 for web-based systems.
It is highly recommended to enforce multi-factor authentication when any user, whether internal or
external, requests sensitive or personal data.
In addition, organizations must have a clear authorization framework in place, which ensures that each
user has exactly the access rights they need to perform a function or consume a service, and no more.
Periodic reviews and automated tools should be used to clean up permissions and remove authorization
for users who no longer need them.
Authentication verifies the identity of a user or service, and authorization determines their access
rights.
Comparing these processes to a real-world example, when you go through security in an airport, you
show your ID to authenticate your identity. Then, when you arrive at the gate, you present your boarding
pass to the flight attendant, so they can authorize you to board your flight and allow access to the plane.
9. Data Security Audits
The organization should perform security audits at least every few months. This identifies gaps and
vulnerabilities across the organizations’ security posture. It is a good idea to perform the audit via a third-
party expert, for example in a penetration testing model. However, it is also possible to perform a security
audit in house. Most importantly, when the audit exposes security issues, the organization must devote
time and resources to address and remediate them.
10. Anti-Malware, Antivirus, and Endpoint Protection
Malware is the most common vector of modern cyberattacks, so organizations must ensure that endpoints
like employee workstations, mobile devices, servers, and cloud systems, have appropriate protection.
Endpoint protection platforms (EPP) take a more comprehensive approach to endpoint security. They
combine antivirus with a machine-learning-based analysis of anomalous behavior on the device, which
can help detect unknown attacks. Most platforms also provide endpoint detection and response (EDR)
capabilities, which help security teams identify breaches on endpoints as they happen, investigate them,
and respond by locking down and reimaging affected endpoints.
11. Zero Trust
Zero trust is a security model introduced by Forrester analyst John Kindervag, which has been adopted by
the US government, several technical standards bodies, and many of the world’s largest technology
companies. The basic principle of zero trust is that no entity on a network should be trusted, regardless of
whether it is outside or inside the network perimeter.
Zero trust has a special focus on data security, because data is the primary asset attackers are interested in.
A zero trust architecture aims to protect data against insider and outside threats by continuously verifying
all access attempts, and denying access by default.
Database Security
Database security involves protecting database management systems such as Oracle, SQL Server, or
MySQL, from unauthorized use and malicious cyberattacks. The main elements protected by database
security are:
 The database management system (DBMS).

Notes Prepared by Chandrakanta Mahanty, Assistant Professor


Department of CSE, 8093488380/8249119544 [email protected] Page 15
 Data stored in the database.
 Applications associated with the DBMS.
 The physical or virtual database server and any underlying hardware.
 Any computing and network infrastructure used to access the database.
A database security strategy involves tools, processes, and methodologies to securely configure and
maintain security inside a database environment and protect databases from intrusion, misuse, and
damage.
Big Data Security
Big data security involves practices and tools used to protect large datasets and data analysis processes.
Big data commonly takes the form of financial logs, healthcare data, data lakes, archives, and business
intelligence datasets.
Big data security aims to prevent accidental and intentional breaches, leaks, losses, and exfiltration of
large amounts of data. Let’s review popular big data services and see the main strategies for securing
them.
AWS Big Data
AWS offers analytics solutions for big data implementations. There are various services AWS offers to
automate data analysis, manipulate datasets, and derive insights, including Amazon Simple Storage
Service (S3), Amazon Kinesis, Amazon Elastic Map/Reduce (EMR), and Amazon Glue.
AWS big data security best practices include:
 Access policy options—use access policy options to manage access to your S3 resources.
 Data encryption policy—use Amazon S3 and AWS KMS for encryption management.
 Manage data with object tagging—categorize and manage S3 data assets using tags, and apply tags
indicating sensitive data that requires special security measures.
Azure Big Data
Microsoft Azure cloud offers big data and analytics services that can process a high volume of structured
and unstructured data. The platform offers elastic storage using Azure storage services, real-time
analytics, database services, as well as machine learning and data engineering solutions.
Azure big data security best practices include:
 Monitor as many processes as possible.
 Leverage Azure Monitor and Log Analytics to gain visibility over data flows.
 Define and enforce a security and privacy policy.
 Leverage Azure services for backup, restore, and disaster recovery.
Google Cloud Big Data
The Google Cloud Platform offers multiple services that support big data storage and analysis. BigQuery
is a high-performance SQL-compatible engine, which can perform analysis on large data volumes in
seconds. Additional services include Dataflow, Dataproc, and Data Fusion.
Google Cloud big data security best practices include:
 Define BigQuery access controls according to the least privilege principle.
 Use policy tags or type-based classification to identify sensitive data.
Snowflake
Snowflake is a cloud data warehouse for enterprises, built for high performance big data analytics. The
architecture of Snowflake physically separates compute and storage, while integrating them logically.
Snowflake offers full relational database support and can work with structured and semi-structured data.
Snowflake security best practices include:
 Leverage key pair authentication and rotation to improve client authentication security.
 Enable multi-factor authentication.
Elasticsearch
Elasticsearch is an open-source full-text search and analytics engine that is highly scalable, allowing
search and analytics on big data in real-time.
 Use strong passwords to protect access to search clusters

Notes Prepared by Chandrakanta Mahanty, Assistant Professor


Department of CSE, 8093488380/8249119544 [email protected] Page 16
 Encrypt all communications using SSL/TLS SSL (Secure Socket Layer) and TLS (Transport Layer
Security)
 Use IP (Internet Protocol) filtering for client access
 Turn on auditing and monitor logs on an regular basis
Securing Data in Enterprise Applications
Enterprise applications power mission critical operations in organizations of all sizes. Enterprise
application security aims to protect enterprise applications from external attacks, abuse of authority, and
data theft.
Email Security
Email security is the process of ensuring the availability, integrity, and reliability of email
communications by protecting them from cyber threats.
Technical standards bodies have recommended email security protocols including SSL/TLS, Sender
Policy Framework (SPF), and DomainKeys Identified Mail (DKIM). These protocols are implemented by
email clients and servers, including Microsoft Exchange and Google G Suite, to ensure secure delivery of
emails. A secure email gateway helps organizations and individuals protect their email from a variety of
threats, in addition to implementing security protocols.
ERP Security
Enterprise Resource Planning (ERP) is software designed to manage and integrate the functions of core
business processes such as finance, human resources, supply chain, and inventory management into one
system. ERP systems store highly sensitive information and are, by definition, a mission critical system.
ERP security is a broad set of measures designed to protect an ERP system from unauthorized access and
ensure the accessibility and integrity of system data. The Information Systems Audit and Control
Association (ISACA) recommends regularly performing security assessments of ERP systems, including
software vulnerabilities, misconfigurations, separation of duties (SoD) conflicts, and compliance with
vendor security recommendations.
DAM Security
Digital Asset Management (DAM) is a technology platform and business process for organizing, storing,
and acquiring rich media and managing digital rights and licenses. Rich media assets include photos,
music, videos, animations, podcasts, and other multimedia content. Data stored in DAM systems is
sensitive because it often represents company IP, and is used in critical processes like sales, marketing,
and delivery of media to viewers and web visitors.
Security best practices for DAM include:
 Implement the principle of least privilege.
 Use multi-factor authentication to control access by third parties.
 Regularly review automation scripts, limit privileges of commands used, and control the automation
process through logging and alerting.
CRM Security
Customer Relationship Management (CRM) is a combination of practices, strategies, and technologies
that businesses use to manage and analyze customer interactions and data throughout the customer
lifecycle. CRM data is highly sensitive because it can expose an organization’s most valuable asset—
customer relationships.
Security best practices for CRM include:
 Perform period IT risk assessment audits for CRM systems.
 Perform CRM activity monitoring to identify unusual or suspicious usage.
 Encourage CRM administrators to follow security best practices.
 Educate CRM users on security best practices.
9. Data collection strategies
Data collection is the process of gathering, measuring, and analyzing accurate data from a variety of
relevant sources to find answers to research problems, answer questions, evaluate outcomes, and forecast
trends and probabilities.

Notes Prepared by Chandrakanta Mahanty, Assistant Professor


Department of CSE, 8093488380/8249119544 [email protected] Page 17
Why is Data Collection important?
 The trustworthiness of The Research – A critical purpose behind data collection via quantitative
or qualitative techniques is to guarantee that the research question’s honesty is kept up without a
doubt.
 Diminish the probability of blunders or errors – The right utilization of suitable data collection
strategies decreases the probability of blunders during different research processes.
 Effective and accurate decision making – To limit the danger of blunders or errors in decision
making, it is significant that precise data is gathered, so the specialists do not settle on clueless
choices.
 Save Cost and Time – Data collection plays a significant role in saving time and money that can
otherwise be squandered without more profound comprehension of the point or topic.
 Empowers a new idea or change – To demonstrate the requirement for an adjustment or new
change, it is critical to collect data and information as proof to help these cases.
Depending on the type of data, the data collection method is divided into two categories namely,
 Primary Data Collection methods
 Secondary Data Collection methods
Primary Data Collection Methods
Primary data or raw data is a type of information that is obtained directly from the first-hand source
through experiments, surveys or observations. The primary data collection method is further classified
into two types. They are
 Quantitative Data Collection Methods
 Qualitative Data Collection Methods
Quantitative Data Collection Methods
It is based on mathematical calculations using various formats like close-ended questions, correlation and
regression methods, mean, median or mode measures. This method is cheaper than qualitative data
collection methods and it can be applied in a short duration of time.
Qualitative Data Collection Methods
It does not involve any mathematical calculations. This method is closely associated with elements that
are not quantifiable. This qualitative data collection method includes interviews, questionnaires,
observations, case studies, etc. There are several methods to collect this type of data. They are
Observation Method
Observation method is used when the study relates to behavioural science. This method is planned
systematically. It is subject to many controls and checks. The different types of observations are:
 Structured and unstructured observation
 Controlled and uncontrolled observation
 Participant, non-participant and disguised observation
Interview Method
The method of collecting data in terms of verbal responses. It is achieved in two ways, such as
 Personal Interview – In this method, a person known as an interviewer is required to ask
questions face to face to the other person. The personal interview can be structured or
unstructured, direct investigation, focused conversation, etc.
 Telephonic Interview – In this method, an interviewer obtains information by contacting people
on the telephone to ask the questions or views, verbally.
Questionnaire Method
In this method, the set of questions are mailed to the respondent. They should read, reply and
subsequently return the questionnaire. The questions are printed in the definite order on the form. A good
survey should have the following features:
 Short and simple, Should follow a logical sequence
 Provide adequate space for answers, Avoid technical terms
Projective Technique

Notes Prepared by Chandrakanta Mahanty, Assistant Professor


Department of CSE, 8093488380/8249119544 [email protected] Page 18
Projective data gathering is an indirect interview, used when potential respondents know why they're
being asked questions and hesitate to answer. For instance, someone may be reluctant to answer questions
about their phone service if a cell phone carrier representative poses the questions. With projective data
gathering, the interviewees get an incomplete question, and they must fill in the rest, using their opinions,
feelings, and attitudes.
Delphi Technique
The Oracle at Delphi, according to Greek mythology, was the high priestess of Apollo’s temple, who
gave advice, prophecies, and counsel. In the realm of data collection, researchers use the Delphi technique
by gathering information from a panel of experts. Each expert answers questions in their field of
specialty, and the replies are consolidated into a single opinion.
Focus Groups
Focus groups, like interviews, are a commonly used technique. The group consists of anywhere from a
half-dozen to a dozen people, led by a moderator, brought together to discuss the issue.
Schedules
This method is similar to the questionnaire method with a slight difference. The enumerations are
specially appointed for the purpose of filling the schedules. It explains the aims and objects of the
investigation and may remove misunderstandings, if any have come up. Enumerators should be trained to
perform their job with hard work and patience.
Secondary Data Collection Methods
Secondary data is data collected by someone other than the actual user. It means that the information is
already available, and someone analyses it. The secondary data includes magazines, newspapers, books,
journals, etc. It may be either published data or unpublished data.
Published data are available in various resources including
 Government publications, Public records, Historical and statistical documents
 Business documents, Technical and trade journals
Unpublished data includes
 Diaries, Letters, Unpublished biographies, etc.
Data Collection Tools
Now that we’ve explained the various techniques, let’s narrow our focus even further by looking at some
specific tools. For example, we mentioned interviews as a technique, but we can further break that down
into different interview types (or ―tools‖).
1. Word Association
The researcher gives the respondent a set of words and asks them what comes to mind when they hear
each word.
2. Sentence Completion
Researchers use sentence completion to understand what kind of ideas the respondent has. This tool
involves giving an incomplete sentence and seeing how the interviewee finishes it.
3. Role-Playing
Respondents are presented with an imaginary situation and asked how they would act or react if it was
real.
4. In-Person Surveys
The researcher asks questions in person.
5. Online/Web Surveys
These surveys are easy to accomplish, but some users may be unwilling to answer truthfully, if at all.
6. Mobile Surveys
These surveys take advantage of the increasing proliferation of mobile technology. Mobile collection
surveys rely on mobile devices like tablets or smartphones to conduct surveys via SMS or mobile apps.
7. Phone Surveys
No researcher can call thousands of people at once, so they need a third party to handle the chore.
However, many people have call screening and won’t answer.

Notes Prepared by Chandrakanta Mahanty, Assistant Professor


Department of CSE, 8093488380/8249119544 [email protected] Page 19
8. Observation
Sometimes, the simplest method is the best. Researchers who make direct observations collect data
quickly and easily, with little intrusion or third-party bias. Naturally, it’s only effective in small-scale
situations.
9. Photography and video: Photographs and videos show still or moving images. Photographs can be
used on their own, but are more often accompanied by written captions, providing additional information.
Videos are often accompanied by a commentary.
10. Focus group discussions: Focus group discussions (FGDs) are facilitated discussions, held with a
small group of people who have specialist knowledge or interest in a particular topic. They are used to
find out the perceptions and attitudes of a defined group of people. FGDs are typically carried out with
around 6-12 people, and are based around a short list of guiding questions, designed to probe for in-depth
information.
11. Case studies and stories of change: A case study is not a data collection tool in itself. It is a
descriptive piece of work that can provide in-depth information on a topic. It is often based on
information acquired through one or more of the other tools described in this paper, such as interviews or
observation. Case studies are usually written, but can also be presented as photographs, films or videos.
Case studies often focus on people (individuals, households, communities). But they can also focus on
any other unit of analysis such as locations, organisations, policies or the environment. Stories of change
are similar to case studies.
12. Surveys and questionnaires: These are designed to collect and record information from many people,
groups or organisations in a consistent way. A questionnaire is a form containing questions. It may be a
printed form or one designed to be filled in online. Questionnaires may be administered in many different
ways. A survey, by contrast, is normally a large, formal exercise. It typically consists of three different
aspects: an approved sampling method designed to ensure the survey is representative of a wider
population; a standard questionnaire that ensures information is collected and recorded consistently; and a
set of analysis methods that allow results and findings to be generated.
What are Common Challenges in Data Collection?
There are some prevalent challenges faced while collecting data, let us explore a few of them to
understand them better and avoid them.
1. Data Quality Issues
The main threat to the broad and successful application of machine learning is poor data quality. Data
quality must be your top priority if you want to make technologies like machine learning work for you.
Let's talk about some of the most prevalent data quality problems in this blog article and how to fix them.
2. Inconsistent Data
When working with various data sources, it's conceivable that the same information will have
discrepancies between sources. The differences could be in formats, units, or occasionally spellings. The
introduction of inconsistent data might also occur during firm mergers or relocations. Inconsistencies in
data have a tendency to accumulate and reduce the value of data if they are not continually resolved.
Organizations that have heavily focused on data consistency do so because they only want reliable data to
support their analytics.
3. Data Downtime
Data is the driving force behind the decisions and operations of data-driven businesses. However, there
may be brief periods when their data is unreliable or not prepared. Customer complaints and subpar
analytical outcomes are only two ways that this data unavailability can have a significant impact on
businesses. A data engineer spends about 80% of their time updating, maintaining, and guaranteeing the
integrity of the data pipeline. In order to ask the next business question, there is a high marginal cost due
to the lengthy operational lead time from data capture to insight.
Schema modifications and migration problems are just two examples of the causes of data downtime.
Data pipelines can be difficult due to their size and complexity. Data downtime must be continuously
monitored, and it must be reduced through automation.

Notes Prepared by Chandrakanta Mahanty, Assistant Professor


Department of CSE, 8093488380/8249119544 [email protected] Page 20
4. Ambiguous Data
Even with thorough oversight, some errors can still occur in massive databases or data lakes. For data
streaming at a fast speed, the issue becomes more overwhelming. Spelling mistakes can go unnoticed,
formatting difficulties can occur, and column heads might be deceptive. This unclear data might cause a
number of problems for reporting and analytics.
5. Duplicate Data
Streaming data, local databases, and cloud data lakes are just a few of the sources of data that modern
enterprises must contend with. They might also have application and system silos. These sources are
likely to duplicate and overlap each other quite a bit. For instance, duplicate contact information has a
substantial impact on customer experience. If certain prospects are ignored while others are engaged
repeatedly, marketing campaigns suffer. The likelihood of biased analytical outcomes increases when
duplicate data are present. It can also result in ML models with biased training data.
6. Too Much Data
While we emphasize data-driven analytics and its advantages, a data quality problem with excessive data
exists. There is a risk of getting lost in an abundance of data when searching for information pertinent to
your analytical efforts. Data scientists, data analysts, and business users devote 80% of their work to
finding and organizing the appropriate data. With an increase in data volume, other problems with data
quality become more serious, particularly when dealing with streaming data and big files or databases.
7. Inaccurate Data
For highly regulated businesses like healthcare, data accuracy is crucial. Given the current experience, it
is more important than ever to increase the data quality for COVID-19 and later pandemics. Inaccurate
information does not provide you with a true picture of the situation and cannot be used to plan the best
course of action. Personalized customer experiences and marketing strategies underperform if your
customer data is inaccurate.
8. Hidden Data
The majority of businesses only utilize a portion of their data, with the remainder sometimes being lost in
data silos or discarded in data graveyards. For instance, the customer service team might not receive client
data from sales, missing an opportunity to build more precise and comprehensive customer profiles.
Missing out on possibilities to develop novel products, enhance services, and streamline procedures is
caused by hidden data.
9. Finding Relevant Data
Finding relevant data is not so easy. There are several factors that we need to consider while trying to find
relevant data, which include -
 Relevant Domain, Relevant demographics
 Relevant Time period and so many more factors that we need to consider while trying to find relevant
data.
Data that is not relevant to our study in any of the factors render it obsolete and we cannot effectively
proceed with its analysis. This could lead to incomplete research or analysis, re-collecting data again and
again, or shutting down the study.
10. Deciding the Data to Collect
Determining what data to collect is one of the most important factors while collecting data and should be
one of the first factors while collecting data. We must choose the subjects the data will cover, the sources
we will be used to gather it, and the quantity of information we will require. Our responses to these
queries will depend on our aims, or what we expect to achieve utilizing your data. As an illustration, we
may choose to gather information on the categories of articles that website visitors between the ages of 20
and 50 most frequently access. We can also decide to compile data on the typical age of all the clients
who made a purchase from your business over the previous month.

Notes Prepared by Chandrakanta Mahanty, Assistant Professor


Department of CSE, 8093488380/8249119544 [email protected] Page 21
What are the Key Steps in the Data Collection Process?
1. Decide What Data You Want to Gather
The first thing that we need to do is decide what information we want to gather. We must choose the
subjects the data will cover, the sources we will use to gather it, and the quantity of information that we
would require. For instance, we may choose to gather information on the categories of products that an
average e-commerce website visitor between the ages of 30 and 45 most frequently searches for.
2. Establish a Deadline for Data Collection
The process of creating a strategy for data collection can now begin. We should set a deadline for our data
collection at the outset of our planning phase. Some forms of data we might want to continuously collect.
We might want to build up a technique for tracking transactional data and website visitor statistics over
the long term, for instance. However, we will track the data throughout a certain time frame if we are
tracking it for a particular campaign. In these situations, we will have a schedule for when we will begin
and finish gathering data.
3. Select a Data Collection Approach
We will select the data collection technique that will serve as the foundation of our data gathering plan at
this stage. We must take into account the type of information that we wish to gather, the time period
during which we will receive it, and the other factors we decide on to choose the best gathering strategy.
4. Gather Information
Once our plan is complete, we can put our data collection plan into action and begin gathering data. In our
DMP, we can store and arrange our data. We need to be careful to follow our plan and keep an eye on
how it's doing. Especially if we are collecting data regularly, setting up a timetable for when we will be
checking in on how our data gathering is going may be helpful. As circumstances alter and we learn new
details, we might need to amend our plan.
5. Examine the Information and Apply Your Findings
It's time to examine our data and arrange our findings after we have gathered all of our information. The
analysis stage is essential because it transforms unprocessed data into insightful knowledge that can be
applied to better our marketing plans, goods, and business judgments. The analytics tools included in our
DMP can be used to assist with this phase. We can put the discoveries to use to enhance our business
once we have discovered the patterns and insights in our data.
10. Data Categororization: NOIR

Notes Prepared by Chandrakanta Mahanty, Assistant Professor


Department of CSE, 8093488380/8249119544 [email protected] Page 22
Nominal Scale of Measurement
A nominal scale of measurement is used for qualitative data. It does not give any numerical meaning to
the data. Using the nominal scale of measurement, the data can be classified but cannot be added,
subtracted, multiplied, or divided. It can cover a wide variety of qualitative data. Some of the situations
where nominal measurement scale can be used are given below:
 Study to find the country of birth of people in a town
 In collecting data on the eye color of people
 Classifying people into categories like male/female, working-class population/unemployed,
vaccinated/unvaccinated people, etc.
Some of the properties of the nominal scale of measurement are given below:
 It can categorize variables but does not put them in any order.
 It does not show any numerical value.
 It is used for qualitative data.
Binary scale
A nominal variable with exactly two mutually exclusive categories that have no logical order is known as
binary variable. Examples: Switch: {ON, OFF}, Attendance: {True, False}, Entry: {Yes, No}
A Binary variable is a special case of a nominal variable that takes only two possible values.
Symmetric and Asymmetric Binary Scale
Different binary variables may have unequal importance.
If two choices of a binary variable have equal importance, then it is called symmetric binary variable.
Example: Gender = {male, female} // usually of equal probability.
If the two choices of a binary variable have unequal importance, it is called asymmetric binary variable.
Example: Food preference = {V, NV}
Ordinal Scale of Measurement
The ordinal scale of measurement groups the data into order or rank. It contains the property of nominal
scale as well, which is to classify data variables into specific labels. And in addition to that, it organizes
data into groups though it does not have any numerical value. For example, the study of people's
satisfaction with a company's product on a scale of #1 - Very happy, #2 - satisfactory, #3 - neutral, #4 -
unhappy, and #5 - extremely dissatisfied. This is an example of an ordinal scale of measurement. This
measurement scale can be used for the following purposes:
 Ranks of players in a race.
 Data collection on variables such as hottest to coldest, richest to poorest, etc.
 Data on people's satisfaction with any product, person, or government.
Ordered nominal data are known as ordinal data and the variable that generates it is called ordinal
variable.
Example: Shirt size = {S, M, L, XL, XXL}
Some of the properties of the ordinal measurement scale are listed below:
 It displays the order or rating of the variables.
 It does not give any numerical value to the data. So, it is also used for qualitative data as similar to
nominal measurement scale.
 It contains variables that can be placed in order like heaviest to lightest, ranks of players or students,
etc.

Notes Prepared by Chandrakanta Mahanty, Assistant Professor


Department of CSE, 8093488380/8249119544 [email protected] Page 23
Interval Scale of Measurement
The interval scale of measurement includes those values that can be measured in a specific interval, for
example, time, temperature, etc. It shows the order of variables with a meaning proportion or difference
between them. For example, on a temperature scale, the difference between 20 °C and 30 °C is the same
as the difference between 50°C ad 60°C. It is an example of an interval measurement scale. On the other
hand, the difference between the scores of the first two rankers in a race and the two runner-ups will be
different, which is an example of an ordinal scale.
Some of the properties of the interval scale of measurement are listed below:
 It includes the properties of both nominal and ordinal scales.
 It shows meaningful divisions between variables.
 The difference between the variables can be presented in numerical terms.
 It includes variables that can be added or subtracted from each other.
 It gives a meaning to 'Zero" which was not possible in the above two scales. For example, zero
degrees of temperature.
Ratio Scale of Measurement
The ratio scale is the most comprehensive scale among others. It includes the properties of all the above
three scales of measurement. The unique feature of the ratio scale of measurement is that it considers the
absolute value of zero, which was not the case in the interval scale. When we measure the height of the
people, 0 inches or 0 cm means that the person does not exist. On the interval scale, there are values
possible on both sides of 0, for example, temperature could be negative as well. While the ratio scale does
not include negative numbers because of its feature of showing absolute zero. An example of the ratio
measurement scale is determining the weight of people from the following options: less than 20 kgs, 20 -
40 kgs, 40 - 60 kgs, 60 - 80 kgs, and more than 80 kgs.
Some of the properties of the ratio scale of measurement are listed below:
 It is used for quantitative data.
 It shows the absolute value of zero which means if the value is 0, it's nothing.
 The variables can be added, subtracted, multiplied, or divided. In addition to these, calculation
of mean, median, and mode is also possible with this scale.
 it doesn't include negative numbers because of the feature of true zero value.
Types of Data:
Qualitative or Categorical Data: Qualitative data, also known as the categorical data, describes the data
that fits into the categories. Qualitative data are not numerical. The categorical information involves
categorical variables that describe the features such as a person’s gender, home town etc. Categorical
measures are defined in terms of natural language specifications, but not in terms of numbers.
Sometimes categorical data can hold numerical values (quantitative value), but those values do not have a
mathematical sense. Examples of the categorical data are birthdate, favourite sport, school postcode.
Here, the birthdate and school postcode hold the quantitative value, but it does not give numerical
meaning.
Quantitative or Numerical Data: Quantitative data is also known as numerical data which represents the
numerical value (i.e., how much, how often, how many). Numerical data gives information about the
quantities of a specific thing. Some examples of numerical data are height, length, size, weight, and so on.
The quantitative data can be classified into two different types based on the data sets. The two different
classifications of numerical data are discrete data and continuous data.
Discrete Data
Discrete data can take only discrete values. Discrete information contains only a finite number of possible
values. Those values cannot be subdivided meaningfully. Here, things can be counted in whole numbers.
Example: Number of students in the class
Continuous Data
Continuous data is data that can be calculated. It has an infinite number of probable values that can be
selected within a given specific range. Example: Temperature range

Notes Prepared by Chandrakanta Mahanty, Assistant Professor


Department of CSE, 8093488380/8249119544 [email protected] Page 24
Unit 2
Descriptive Statistics
Statistics is the science, or a branch of mathematics, that involves collecting, classifying, analyzing,
interpreting, and presenting numerical facts and data.
The study of numerical and graphical ways to describe and display your data is called descriptive
statistics. It describes the data and helps us understand the features of the data by summarizing the
given sample set or population of data.
Types of descriptive statistics
There are 3 main types of descriptive statistics:
[1] The distribution concerns the frequency of each value.
[2] The central tendency concerns the averages of the values.
[3] The variability or dispersion concerns how spread out the values are.
Distribution (also called Frequency Distribution)
Datasets consist of a distribution of scores or values. Statisticians use graphs and tables to summarize
the frequency of every possible value of a variable, rendered in percentages or numbers.
Measures of central tendency estimate a dataset's average or center, finding the result using three
methods: mean, mode, and median.
Variability (also called Dispersion)
The measure of variability gives the statistician an idea of how spread out the responses are. The
spread has three aspects — range, standard deviation, and variance.
Research example
You want to study the popularity of different leisure activities by gender. You distribute a survey and
ask participants how many times they did each of the following in the past year:
 Go to a library
 Watch a movie at a theater
 Visit a national park
Your data set is the collection of responses to the survey. Now you can use descriptive statistics to
find out the overall frequency of each activity (distribution), the averages for each activity (central
tendency), and the spread of responses for each activity (variability).
Measures of central tendency
Advantages and disadvantages of mean, median and mode.
Mean is the most commonly used measures of central tendency. It represents the average of the given
collection of data. Median is the middle value among the observed set of values and is calculated by
arranging the values in ascending order or in descending order and then choosing the middle value.
The most frequent number occurring in the data set is known as the mode.
Data Advantages Disadvantages
Mean Takes account of all A very small or very large value can affect the mean.
values to calculate the
average.
Median The median is not Since the median is an average of position, therefore arranging
affected by very large or the data in ascending or descending order of magnitude is time-
very small values. consuming in the case of a large number of observations.
Mode The only averages that There can be more than one mode, and there can also be no
can be used if the data mode which means the mode is not always representative of the
set is not in numbers. data.
Notes Prepared by Chandrakanta Mahanty, Assistant Professor
Department of CSE, 8093488380/8249119544 [email protected]
Measures of Deviation:
Mean Deviation: In statistics, deviation means the difference between the observed and expected
values of a variable. In simple words, the deviation is the distance from the centre point. The centre
point can be median, mean, or mode. Similarly, the mean deviation definition in statistics or the mean
absolute deviation is used to compute how far the values fall from the middle of the data set.
Mean Deviation, also known as Mean Absolute Deviation, is the average Deviation of a Data point
from the Data set's Mean, median, or Mode. The term "Mean Deviation" is abbreviated as MAD.
Mean Deviation Formula

Example 1: Calculate the Mean Deviation about the median using the Data given below:
(Ungrouped Data)

Notes Prepared by Chandrakanta Mahanty, Assistant Professor


Department of CSE, 8093488380/8249119544 [email protected]
Example 2:

Example 3: Find the mean deviation for the following data set. (Grouped Data)

Notes Prepared by Chandrakanta Mahanty, Assistant Professor


Department of CSE, 8093488380/8249119544 [email protected]
Example 4: Find the mean deviation about the median for the following data.

*The coefficient of mean deviation is calculated by dividing mean deviation by the average.
Notes Prepared by Chandrakanta Mahanty, Assistant Professor
Department of CSE, 8093488380/8249119544 [email protected]
Standard Deviation and Variance
Standard deviation is the positive square root of the variance. Standard deviation is the degree of
dispersion or the scatter of the data points relative to its mean, in descriptive statistics. It tells how the
values are spread across the data sample and it is the measure of the variation of the data points from the
mean. Variance is a measure of how data points vary from the mean, whereas standard deviation is the
measure of the distribution of statistical data. The basic difference between variance and the standard
deviation is in their units. The standard deviation is represented in the same units as the mean of data,
while the variance is represented in squared units.

Population Variance - All the members of a group are known as the population. When we want to find
how each data point in a given population varies or is spread out then we use the population variance. It is
used to give the squared distance of each data point from the population mean.
Sample Variance - If the size of the population is too large then it is difficult to take each data point into
consideration. In such a case, a select number of data points are picked up from the population to form the
sample that can describe the entire group. Thus, the sample variance can be defined as the average of the
squared distances from the mean. The variance is always calculated with respect to the sample mean.
Standard Deviation of Ungrouped Data
Example: If a die is rolled, then find the variance and standard deviation of the possibilities.
Solution: When a die is rolled, the possible number of outcomes is 6. So the sample space, n = 6 and the
data set = { 1;2;3;4;5;6}.
To find the variance, first, we need to calculate the mean of the data set.
Mean, x̅ = (1+2+3+4+5+6)/6 = 3.5
We can put the value of data and mean in the formula to get;

Notes Prepared by Chandrakanta Mahanty, Assistant Professor


Department of CSE, 8093488380/8249119544 [email protected]
Example: There are 39 plants in the garden. A few plants were selected randomly and their heights
in cm were recorded as follows: 51, 38, 79, 46, and 57. Calculate the standard deviation of their
heights. N=5.

Standard Deviation of Grouped Data

Notes Prepared by Chandrakanta Mahanty, Assistant Professor


Department of CSE, 8093488380/8249119544 [email protected]
Coefficient of Variation
Coefficient of variation is a type of relative measure of dispersion. It is expressed as the ratio of the
standard deviation to the mean. A measure of dispersion is a quantity that is used to gauge the extent of
variability of data.

Notes Prepared by Chandrakanta Mahanty, Assistant Professor


Department of CSE, 8093488380/8249119544 [email protected]
Shape of data: Skewness and Kurtosis
Skewness is a measure of symmetry, or more precisely, the lack of symmetry. If one tail is longer than
another, the distribution is skewed. These distributions are sometimes called asymmetric or asymmetrical
distributions as they don‘t show any kind of symmetry. Symmetry means that one half of the distribution
is a mirror image of the other half. For example, the normal distribution is a symmetric distribution with
no skew. The tails are exactly the same. The symmetrical distribution has zero skewness as all measures
of a central tendency lies in the middle.
The skewness is a measure of symmetry or asymmetry of data distribution, and kurtosis measures whether
data is heavy-tailed or light-tailed in a normal distribution. Data can be positive-skewed (data-pushed
towards the right side) or negative-skewed (data-pushed towards the left side).

When data is symmetrically distributed, the left-hand side, and right-hand side, contain the same number
of observations. (If the dataset has 90 values, then the left-hand side has 45 observations, and the right-
hand side has 45 observations.). But, what if not symmetrical distributed? That data is called
asymmetrical data, and that time skewness comes into the picture.
Types of skewness
1. Positive skewed or right-skewed
In statistics, a positively skewed distribution is a sort of distribution where, unlike symmetrically
distributed data where all measures of the central tendency (mean, median, and mode) equal each
other, with positively skewed data, the measures are dispersing, which means Positively Skewed
Distribution is a type of distribution where the mean, median, and mode of the distribution are positive
rather than negative or zero.

Notes Prepared by Chandrakanta Mahanty, Assistant Professor


Department of CSE, 8093488380/8249119544 [email protected]
2. Negative skewed or left-skewed
A negatively skewed distribution is the straight reverse of a positively skewed distribution. In statistics,
negatively skewed distribution refers to the distribution model where more values are plots on the right
side of the graph, and the tail of the distribution is spreading on the left side. In negatively skewed, the
mean of the data is less than the median (a large number of data-pushed on the left-hand side). Negatively
Skewed Distribution is a type of distribution where the mean, median, and mode of the distribution are
negative rather than positive or zero.

Calculate the skewness coefficient of the sample

As Pearson‘s correlation coefficient differs from -1 (perfect negative linear relationship) to +1 (perfect
positive linear relationship), including a value of 0 indicating no linear relationship, When we divide the
covariance values by the standard deviation, it truly scales the value down to a limited range of -1 to
+1. That accurately the range of the correlation values.
Pearson‘s first coefficient of skewness is helping if the data present high mode. But, if the data have low
mode or various modes, Pearson‘s first coefficient is not preferred, and Pearson‘s second coefficient may
be superior, as it does not rely on the mode.

If the skewness is between -0.5 & 0.5, the data are nearly symmetrical.
If the skewness is between -1 & -0.5 (negative skewed) or between 0.5 & 1(positive skewed), the data are
slightly skewed.
If the skewness is lower than -1 (negative skewed) or greater than 1 (positive skewed), the data are
extremely skewed.

Notes Prepared by Chandrakanta Mahanty, Assistant Professor


Department of CSE, 8093488380/8249119544 [email protected]
Kurtosis
Kurtosis is a statistical measure, whether the data is heavy-tailed or light-tailed in a normal distribution.

In finance, kurtosis is used as a measure of financial risk. A large kurtosis is associated with a high level
of risk for an investment because it indicates that there are high probabilities of extremely large and
extremely small returns. On the other hand, a small kurtosis signals a moderate level of risk because the
probabilities of extreme returns are relatively low.
Notes Prepared by Chandrakanta Mahanty, Assistant Professor
Department of CSE, 8093488380/8249119544 [email protected]
Excess Kurtosis
The excess kurtosis is used in statistics and probability theory to compare the kurtosis coefficient with
that normal distribution. Excess kurtosis can be positive (Leptokurtic distribution), negative (Platykurtic
distribution), or near to zero (Mesokurtic distribution). Since normal distributions have a kurtosis of 3,
excess kurtosis is calculating by subtracting kurtosis by 3.
Excess kurtosis = Kurt – 3
Types of excess kurtosis
1. Leptokurtic (kurtosis > 3)
Leptokurtic is having very long and skinny tails, which means there are more chances of outliers. Positive
values of kurtosis indicate that distribution is peaked and possesses thick tails. An extreme positive
kurtosis indicates a distribution where more of the numbers are located in the tails of the distribution
instead of around the mean.
2. platykurtic (kurtosis < 3)
Platykurtic having a lower tail and stretched around center tails means most of the data points are present
in high proximity with mean. A platykurtic distribution is flatter (less peaked) when compared with the
normal distribution.
3. Mesokurtic (kurtosis = 3)
Mesokurtic is the same as the normal distribution, which means kurtosis is near to 0. In Mesokurtic,
distributions are moderate in breadth, and curves are a medium peaked height.
Excess kurtosis can be positive (Leptokurtic distribution), negative (Platykurtic distribution), or
near to zero (Mesokurtic distribution).
Leptokurtic distribution (kurtosis more than normal distribution).
Mesokurtic distribution (kurtosis same as the normal distribution).
Platykurtic distribution (kurtosis less than normal distribution).
Calculate Population Skewness and Population Kurtosis from the following grouped data
Example-1

Notes Prepared by Chandrakanta Mahanty, Assistant Professor


Department of CSE, 8093488380/8249119544 [email protected]
f f

Karl Pearson Coefficient of Correlation


The study of Karl Pearson Coefficient is an inevitable part of Statistics. Statistics is majorly dependent on
Karl Pearson Coefficient Correlation method. The Karl Pearson coefficient is defined as a linear
correlation that falls in the numeric range of -1 to +1.
This is a quantitative method that offers the numeric value to form the intensity of the linear relationship
between the X and Y variable. But is it really useful for any economic calculation? Let, us find and delve
into this topic to get more detailed information on the subject matter – Karl Pearson Coefficient of
Correlation.
What do You mean by Correlation Coefficient?
Before delving into details about Karl Pearson Coefficient of Correlation, it is vital to brush up on
fundamental concepts about correlation and its coefficient in general.
The correlation coefficient can be defined as a measure of the relationship between two quantitative or
qualitative variables, i.e., X and Y. It serves as a statistical tool that helps to analyze and in turn, measure
the degree of the linear relationship between the variables.
For example, a change in the monthly income (X) of a person leads to a change in their monthly
expenditure (Y). With the help of correlation, you can measure the degree up to which such a change can
impact the other variables.
Types of Correlation Coefficient
Depending on the direction of the relationship between variables, correlation can be of three types,
namely –
Positive Correlation (0 to +1)
In this case, the direction of change between X and Y is the same. For instance, an increase in the
duration of a workout leads to an increase in the number of calories one burns.
Negative Correlation (0 to -1)
Here, the direction of change between X and Y variables is opposite. For example, when the price of a
commodity increases its demand decreases.
Zero Correlation (0)
There is no relationship between the variables in this case. For instance, an increase in height has no
impact on one‘s intelligence.
What is Karl Pearson’s Coefficient of Correlation?
This method is also known as the Product Moment Correlation Coefficient and was developed by Karl
Pearson. It is one of the three most potent and extensively used methods to measure the level of
correlation, besides the Scatter Diagram and Spearman‘s Rank Correlation.
The Karl Pearson correlation coefficient method is quantitative and offers numerical value to establish the
intensity of the linear relationship between X and Y. Such a coefficient correlation is represented as ‗r‘.

Notes Prepared by Chandrakanta Mahanty, Assistant Professor


Department of CSE, 8093488380/8249119544 [email protected]
The Karl Pearson Coefficient of Correlation formula is expressed as:

Direct Method/ Actual Mean Method

Marks obtained by 5 students in algebra and trigonometry as given below: Calculate Karl Pearson
Coefficient of Correlation without taking deviation from mean.

r= -0.424
Notes Prepared by Chandrakanta Mahanty, Assistant Professor
Department of CSE, 8093488380/8249119544 [email protected]
Assumed Mean Method

Box Plot
When we display the data distribution in a standardized way using 5 summary – minimum, Q1 (First
Quartile), median, Q3(third Quartile), and maximum, it is called a Box plot. It is also termed as box and
whisker plot. It is a type of chart that depicts a group of numerical data through their quartiles. It is a
simple way to visualize the shape of our data. It makes comparing characteristics of data between
categories very easy.
Parts of Box Plots

Minimum: The minimum value in the given dataset


First Quartile (Q1): The first quartile is the median of the lower half of the data set.

Notes Prepared by Chandrakanta Mahanty, Assistant Professor


Department of CSE, 8093488380/8249119544 [email protected]
Median: The median is the middle value of the dataset, which divides the given dataset into two equal
parts. The median is considered as the second quartile.
Third Quartile (Q3): The third quartile is the median of the upper half of the data.
Maximum: The maximum value in the given dataset.
Apart from these five terms, the other terms used in the box plot are:
Interquartile Range (IQR): The difference between the third quartile and first quartile is known as the
interquartile range. (i.e.) IQR = Q3-Q1
Outlier: The data that falls on the far left or right side of the ordered data is tested to be the outliers.
Generally, the outliers fall more than the specified distance from the first and third quartile.
(i.e.) Outliers are greater than Q3+(1.5 . IQR) or less than Q1-(1.5 . IQR).
Positively Skewed: If the distance from the median to the maximum is greater than the distance from the
median to the minimum, then the box plot is positively skewed.
Negatively Skewed: If the distance from the median to minimum is greater than the distance from the
median to the maximum, then the box plot is negatively skewed.
Symmetric: The box plot is said to be symmetric if the median is equidistant from the maximum and
minimum values.
The median ( Q2 ) divides the data set into two parts, the upper set and the lower set. The lower
quartile ( Q1 ) is the median of the lower half, and the upper quartile ( Q3 ) is the median of the upper
half.
Example -1:
Find Q1 , Q2 , and Q3 for the following data set, and draw a box-and-whisker plot.
{2,6,7,8,8,11,12,13,14,15,22,23}
There are 12 data points. The middle two are 11 and 12. So the median, Q2 , is 11.5 .
The "lower half" of the data set is the set {2,6,7,8,8,11}. The median here is 7.5 So Q1=7.5.
The "upper half" of the data set is the set {12,13,14,15,22,23}. The median here is 14.5. So Q3=14.5.
A box-and-whisker plot displays the values Q1 , Q2 , and Q3 , along with the extreme values of the data
set ( 2 and 23 , in this case):

A box & whisker plot shows a "box" with left edge at Q1 , right edge at Q3 , the "middle" of the box
at Q2 (the median) and the maximum and minimum as "whiskers".
Example -2: Let us take a sample data to understand how to create a box plot.
Here are the runs scored by a cricket team in a league of 12 matches.

Notes Prepared by Chandrakanta Mahanty, Assistant Professor


Department of CSE, 8093488380/8249119544 [email protected]
Note: If the total number of values is odd then we exclude the Median while calculating Q1 and Q3. Here
since there were two central values we included them. Now, we need to calculate the Inter Quartile
Range.

The outliers which are outside the range are Outliers= 220

Example-3: Calculate Box and Whisker Plots from the following grouped data

Notes Prepared by Chandrakanta Mahanty, Assistant Professor


Department of CSE, 8093488380/8249119544 [email protected]
Uses of a Box Plot
Box plots provide a visual summary of the data with which we can quickly identify the average value of
the data, how dispersed the data is, whether the data is skewed or not (skewness).

Pivot Tables:
A pivot table is a powerful data summarization tool that can automatically sort, count, and sum up data
stored in tables and display the summarized data. Pivot tables are useful to quickly create crosstabs (a
process or function that combines and/or summarizes data from one or more sources into a concise format
for analysis or reporting) to display the joint distribution of two or more variables.
Typically, with a pivot table the user sets up and changes the data summary's structure by dragging and
dropping fields graphically. This "rotation" or pivoting of the summary table gives the concept its name.
Notes Prepared by Chandrakanta Mahanty, Assistant Professor
Department of CSE, 8093488380/8249119544 [email protected]
Three key reasons for organizing data into a pivot table are:
 To summarize the data contained in a lengthy list into a compact format.
 To find relationships within the data those are otherwise hard to see because of the amount of detail.
 To organize the data into a format that‘s easy to read.
HeatMap
Heatmaps visualize the data in a 2-dimensional format in the form of colored maps. The color maps use
hue, saturation, or luminance to achieve color variation to display various details. This color variation
gives visual cues to the readers about the magnitude of numeric values. HeatMaps is about replacing
numbers with colors because the human brain understands visuals better than numbers, text, or any
written data.
A heatmap (or heat map) is a graphical representation of numerical data, where individual data points
contained in the data set are represented using different colors. The key benefit of heatmaps is that they
simplify complex numerical data into visualizations that can be understood at a glance. For example, on
website heatmaps ‗hot‘ colours depict high user engagement, while ‗cold‘ colours depict low engagement.
Uses of HeatMap
1. Business Analytics: A heat map is used as a visual business analytics tool. A heat map gives quick
visual cues about the current results, performance, and scope for improvements. Heatmaps can analyze
the existing data and find areas of intensity that might reflect where most customers reside, areas of risk
of market saturation, or cold sites and sites that need a boost.
2. Website: Heatmaps are used in websites to visualize data of visitors‘ behavior. This visualization helps
business owners and marketers to identify the best & worst-performing sections of a webpage.
3. Exploratory Data Analysis: EDA is a task performed by data scientists to get familiar with the data.
All the initial studies are done to understand the data are known as EDA. EDA is done to summarize their
main features, often with visual methods, which includes Heatmaps.
4. Molecular Biology: Heat maps are used to study disparity and similarity patterns in DNA, RNA, etc.
5. Marketing and Sales: The heatmap‘s capability to detect warm and cold spots is used to improve
marketing response rates by targeted marketing. Heatmaps allow the detection of areas that respond to
campaigns, under-served markets, customer residence, and high sale trends, which helps optimize product
lineups, capitalize on sales, create targeted customer segments, and assess regional demographics.
Types of HeatMaps
Typically, there are two types of Heatmaps:
Grid Heatmap: The magnitudes of values shown through colors are laid out into a matrix of rows and
columns, mostly by a density-based function. Below are the types of Grid Heatmaps.
Clustered Heatmap: The goal of Clustered Heatmap is to build associations between both the data points
and their features. This type of heatmap implements clustering as part of the process of grouping similar
features. Clustered Heatmaps are widely used in biological sciences for studying gene similarities across
individuals. Correlogram: A correlogram replaces each of the variables on the two axes with numeric
variables in the dataset. Each square depicts the relationship between the two intersecting variables, which
helps to build descriptive or predictive statistical models.
Spatial Heatmap: Each square in a Heatmap is assigned a color representation according to the nearby
cells‘ value. The location of color is according to the magnitude of the value in that particular space.
These Heatmaps are data-driven ―paint by numbers‖ canvas overlaid on top of an image. The cells with
higher values than other cells are given a hot color, while cells with lower values are assigned a cold
color.
Notes Prepared by Chandrakanta Mahanty, Assistant Professor
Department of CSE, 8093488380/8249119544 [email protected]
ANOVA, which stands for Analysis of Variance, is a statistical test used to analyze the difference
between the means of more than two groups. A one-way ANOVA uses one independent variable, while a
two-way ANOVA uses two independent variables.
A One-Way ANOVA is used to determine how one factor impacts a response variable. For example, we
might want to know if three different studying techniques lead to different mean exam scores. To see if
there is a statistically significant difference in mean exam scores, we can conduct a one-way ANOVA.

A Two-Way ANOVA is used to determine how two factors impact a response variable, and to determine
whether or not there is an interaction between the two factors on the response variable. For example, we
might want to know how gender and how different levels of exercise impact average weight loss. We
would conduct a two-way ANOVA to find out.

ANOVA Real Life Example #1


A large scale farm is interested in understanding which of three different fertilizers leads to the highest
crop yield. They sprinkle each fertilizer on ten different fields and measure the total yield at the end of the
growing season. To understand whether there is a statistically significant difference in the mean yield that
results from these three fertilizers, researchers can conduct a one-way ANOVA, using ―type of fertilizer‖
as the factor and ―crop yield‖ as the response.
ANOVA Real Life Example #2
An example to understand this can be prescribing medicines. Suppose, there is a group of patients who
are suffering from fever. They are being given three different medicines that have the same functionality
i.e. to cure fever. To understand the effectiveness of each medicine and choose the best among them, the
ANOVA test is used.
ANOVA is used in a wide variety of real-life situations, but the most common include:
Retail: Store are often interested in understanding whether different types of promotions, store layouts,
advertisement tactics, etc. lead to different sales. This is the exact type of analysis that ANOVA is built
for.
Medical: Researchers are often interested in whether or not different medications affect patients
differently, which is why they often use one-way or two-way ANOVA‘s in these situations.
Environmental Sciences: Researchers are often interested in understanding how different levels of factors
affect plants and wildlife. Because of the nature of these types of analyses, ANOVA‘s are often used.
What is ANOVA Test
ANOVA test, in its simplest form, is used to check whether the means of three or more populations are
equal or not. The ANOVA test applies when there are more than two independent groups. The goal of the
ANOVA test is to check for variability within the groups as well as the variability among the groups. The
ANOVA test statistic is given by the f test.
ANOVA Test Definition
ANOVA test can be defined as a type of test used in hypothesis testing to compare whether the means of
two or more groups are equal or not. This test is used to check if the null hypothesis can be rejected or not
Notes Prepared by Chandrakanta Mahanty, Assistant Professor
Department of CSE, 8093488380/8249119544 [email protected]
depending upon the statistical significance exhibited by the parameters. The decision is made by
comparing the ANOVA test statistic with the critical value.
The steps to perform the one way ANOVA test are given below:
Step 1: Calculate the mean for each group.
Step 2: Calculate the total mean. This is done by adding all the means and dividing it by the total number
of means.
Step 3: Calculate the SSB/ SSC.
Step 4: Calculate the between groups degrees of freedom.
Step 5: Calculate the SSE.
Step 6: Calculate the degrees of freedom of errors.
Step 7: Determine the MSB and the MSE.
Step 8: Find the f test statistic.
Step 9: Using the f table for the specified level of significance, find the critical value. This is given by
F(, df1. df2).
Step 10: If f > F then rejects the null hypothesis.
Assumptions for ANOVA
 Each group sample is drawn from a normally distributed population
 All populations have a common variance
 All samples are drawn independently of each other
 Within each sample, the observations are sampled randomly and independently of each other

Notes Prepared by Chandrakanta Mahanty, Assistant Professor


Department of CSE, 8093488380/8249119544 [email protected]
ANOVA also uses a Null hypothesis and an Alternate hypothesis. The Null hypothesis in ANOVA is
valid when all the sample means are equal, or they don‘t have any significant difference. Thus, they can
be considered as a part of a larger set of the population. On the other hand, the alternate hypothesis is
valid when at least one of the sample means is different from the rest of the sample means. In
mathematical form, they can be represented as:

Sums of Squares: In statistics, the sum of squares is defined as a statistical technique that is used in
regression analysis to determine the dispersion of data points. In the ANOVA test, it is used while
computing the value of F. As the sum of squares tells you about the deviation from the mean, it is also
known as variation.
Degrees of Freedom: Degrees of Freedom refer to the maximum numbers of logically independent
values that have the freedom to vary in a data set.
The Mean Squared Error: It tells us about the average error in a data set. To find the mean squared
error, we just divide the sum of squares by the degrees of freedom.
Example-1

Notes Prepared by Chandrakanta Mahanty, Assistant Professor


Department of CSE, 8093488380/8249119544 [email protected]
Example-2:

F-Table

Notes Prepared by Chandrakanta Mahanty, Assistant Professor


Department of CSE, 8093488380/8249119544 [email protected]
Data Pre-processing
Data preprocessing is the process of converting raw data into clean data that is proper for modeling. A
model fails for various reasons. Data preprocessing can significantly impact model results, such as
imputing missing value and handling with outliers. Data preprocessing is done to improve the quality of
data in data warehouse.
 Increases efficiency
 Ease of data mining process
 Removes noisy data, inconsistent data and incomplete data
Data mining and data warehousing are very powerful and popular techniques for analyzing and storing
data, respectively. Data warehousing is all about compiling and organizing data in a common database,
while data mining refers to the process of extracting important data from the databases. With the
definition, we can conclude that the data mining process is dependent on the data warehouse for
identifying patterns in data and draw relevant conclusions. The process of data mining involves the use of
statistical models and algorithms to find hidden patterns in the data.
Major tasks of Data preprocessing-
Data Cleaning: It is a process to clean the data in such a way that data can be easily integrated.
Data Integration: It is a process to integrate/combine all the data.
Data Reduction: It is a process to reduce the large data into smaller once in such a way that data can be
easily transformed further.
Data Transformation: It is a process to transform the data into a reliable shape.
Data Discretization: It converts a large number of data values into smaller once, so that data evaluation
and data management becomes very easy.
Data Cleaning
After you load the data, the first thing is to check how many variables are there, the type of variables, the
distributions, and data errors. It cleans the data by filling in the missing values, smoothing noisy data,
resolving the inconsistency and removing the outliers.
Things to pay attention are:
 There are some missing values.
 There are outliers for store expenses (store_exp). The maximum value is 50000. Who would spend
$50000 a year buying clothes? Is it an imputation error?
 There is a negative value ( -500) in store_exp which is not logical.
 Someone is 300 years old.
 Enter phone number in wrong format.
How can we clean Data:
Data validation: apply some constraints to make sure you have valid and consistent data. Data
validation is the process of ensuring data has undergone data cleansing to ensure they have, that is, that
they are both correct and useful.
Data screening: Data screening is a method which applies to remove all error from data and make it
correct for statistical analysis
De-duplication: Delete the duplicate data.
String matching method: Identify the close matches between your data and valid value
Approaches in Data Cleaning
1. Missing values
2. Noisy Data
Notes Prepared by Chandrakanta Mahanty, Assistant Professor
Department of CSE, 8093488380/8249119544 [email protected]
1. Missing Values:
It is defined as the value or data that are not stored for some variable in the given dataset
How can you go about filling in the missing values for this attribute? Let‘s look at the following methods.
 Ignore the data row: This method is suggested for records where maximum amount of data is missing,
rendering the record meaningless. This method is usually avoided where only less attribute values are
missing. If all the rows with missing values are ignored i.e. removed, it will result in poor
performance.
 Fill the missing values manually: This is a very time consuming method and hence infeasible for
almost all scenarios.
 Use a global constant to fill in for missing values: A global constant like ―NA‖ or 0 can be used to fill
all the missing data. This method is used when missing values are difficult to be predicted.
 Use attribute mean or median: Mean or median of the attribute is used to fill the missing value.
 Use forward fill or backward fill method: In this, either the previous value or the next value is used to
fill the missing value. A mean of the previous and succession values may also be used.
 Use the most probable value to fill in the missing value: (Decision tree, Regression method)

2. Noisy Data: Noise is a random error or variance in a measured variable.


Approaches for Noisy Data
1. Binning 2. Regression 3.Clustering
1. Binning Methods for Data Smoothing
The binning method can be used for smoothing the data.
Mostly data is full of noise. Data smoothing is a data pre-processing technique using a different kind of
algorithm to remove the noise from the data set. This allows important patterns to stand out.
Unsorted data for price in dollars
Before sorting: 8 16, 9, 15, 21, 21, 24, 30, 26, 27, 30, 34
First of all, sort the data
After Sorting: 8, 9, 15, 16, 21, 21, 24, 26, 27, 30, 30, 34
Smoothing the data by equal frequency bins
Bin 1: 8, 9, 15, 16
Bin 2: 21, 21, 24, 26,
Bin 3: 27, 30, 30, 34
Smoothing by bin means
For Bin 1:
(8+ 9 + 15 +16 / 4) = 12
(4 indicating the total values like 8, 9 , 15, 16)
Bin 1 = 12, 12, 12, 12
For Bin 2:
(21 + 21 + 24 + 26 / 4) = 23
Bin 2 = 23, 23, 23, 23
For Bin 3:
Notes Prepared by Chandrakanta Mahanty, Assistant Professor
Department of CSE, 8093488380/8249119544 [email protected]
(27 + 30 + 30 + 34 / 4) = 30
Bin 3 = 30, 30, 30, 30
Smoothing by bin boundaries
Bin 1: 8, 8, 8, 15
Bin 2: 21, 21, 25, 25
Bin 3: 26, 26, 26, 34
How to smooth data by bin boundaries?
You need to pick the minimum and maximum value. Put the minimum on the left side and maximum on
the right side.
Now, what will happen to the middle values?
Middle values in bin boundaries move to its closest neighbor value with less distance.
Unsorted data for price in dollars:
Before sorting: 8 16, 9, 15, 21, 21, 24, 30, 26, 27, 30, 34
First of all, sort the data
After sorting: 8, 9, 15, 16, 21, 21, 24, 26, 27, 30, 30, 34
Smoothing the data by equal frequency bins
Bin 1: 8, 9, 15, 16
Bin 2: 21, 21, 24, 26,
Bin 3: 27, 30, 30, 34
Smooth data after bin Boundary
Before bin Boundary: Bin 1: 8, 9, 15, 16
Here, 1 is the minimum value and 16 is the maximum value.9 is near to 8, so 9 will be treated as 8. 15 is
more near to 16 and farther away from 8. So, 15 will be treated as 16.
After bin Boundary: Bin 1: 8, 8, 16, 16
Before bin Boundary: Bin 2: 21, 21, 24, 26,
After bin Boundary: Bin 2: 21, 21, 26, 26,
Before bin Boundary: Bin 3: 27, 30, 30, 34
After bin Boundary: Bin 3: 27, 27, 27, 34
2. Regression:
Here data can be made smooth by fitting it to a regression function. The regression used may be linear
(having one independent variable) or multiple (having multiple independent variables).
3. Clustering:
This approach groups the similar data in a cluster. The outliers may be undetected or it will fall outside
the clusters.
Data Integration
Data integration is the process of merging data from several disparate sources. While performing data
integration, you must work on data redundancy, inconsistency, duplicity, etc.
Data integration is important because it gives a uniform view of scattered data while also maintaining data
accuracy.
Data Integration Approaches
There are mainly two types of approaches for data integration. These are as follows:
Tight Coupling: It is the process of using ETL (Extraction, Transformation, and Loading) to combine
data from various sources into a single physical location.
Loose Coupling: Facts with loose coupling are most effectively kept in the actual source databases. This
approach provides an interface that gets a query from the user, changes it into a format that the supply
database may understand, and then sends the query to the source databases without delay to obtain the
result.
Data Integration Techniques
There are various data integration techniques in data mining. Some of them are as follows:

Notes Prepared by Chandrakanta Mahanty, Assistant Professor


Department of CSE, 8093488380/8249119544 [email protected]
Manual Integration: This method avoids using automation during data integration. The data analyst
collects, cleans, and integrates the data to produce meaningful information. This strategy is suitable for a
mini organization with a limited data set. It is a time-consuming operation.
Middleware Integration: The middleware software is used to take data from many sources, normalize it,
and store it in the resulting data set.
Application-based integration: It is using software applications to extract, transform, and load data from
disparate sources. This strategy saves time and effort, but it is a little more complicated because building
such an application necessitates technical understanding.
Uniform Access Integration: This method combines data from a more disparate source. However, the
data's position is not altered in this scenario; the data stays in its original location. This technique merely
generates a unified view of the integrated data. The integrated data does not need to be stored separately
because the end-user only sees the integrated view.
Data Transformation:
Data transformation changes the format, structure, or values of the data and converts them into clean,
usable data.
Data Transformation Techniques
1. Data Smoothing
Data smoothing is a process that is used to remove noise from the dataset using some algorithms. It
allows for highlighting important features present in the dataset. It helps in predicting the patterns. When
collecting data, it can be manipulated to eliminate or reduce any variance or any other noise form.
2. Attribute Construction
In the attribute construction method, the new attributes consult the existing attributes to construct a new
data set that eases data mining. New attributes are created and applied to assist the mining process from
the given attributes. This simplifies the original data and makes the mining more efficient.
For example, suppose we have a data set referring to measurements of different plots, i.e., we may have
the height and width of each plot. So here, we can construct a new attribute 'area' from attributes 'height'
and 'weight'. This also helps understand the relations among the attributes in a data set.
3. Data Aggregation
Data collection or aggregation is the method of storing and presenting data in a summary format. The data
may be obtained from multiple data sources to integrate these data sources into a data analysis
description. This is a crucial step since the accuracy of data analysis insights is highly dependent on the
quantity and quality of the data used.
For example, we have a data set of sales reports of an enterprise that has quarterly sales of each year. We
can aggregate the data to get the enterprise's annual sales report.
4. Data Normalization: Normalizing the data refers to scaling the data values to a much smaller range
such as [-1, 1] or [0.0, 1.0]. There are different methods to normalize the data, as discussed below.
Consider that we have a numeric attribute A and we have n number of observed values for attribute A that
are V1, V2, V3, ….Vn.
Min-max normalization: This method implements a linear transformation on the original data. Let us
consider that we have minA and maxA as the minimum and maximum value observed for attribute A and
Viis the value for attribute A that has to be normalized. The min-max normalization would map Vi to the
V'i in a new smaller range [new_minA, new_maxA].
The formula for min-max normalization is given below:

For example, we have $1200 and $9800 as the minimum, and maximum value for the attribute income,

Notes Prepared by Chandrakanta Mahanty, Assistant Professor


Department of CSE, 8093488380/8249119544 [email protected]
and [0.0, 1.0] is the range in which we have to map a value of $73,600.
The value $73,600 would be transformed using min-max normalization as follows:

Z-score normalization: This method normalizes the value for attribute A using the mean and standard
deviation. The following formula is used for Z-score normalization:

Here Ᾱ and σA are the mean and standard deviation for attribute A, respectively.
For example, we have a mean and standard deviation for attribute A as $54,000 and $16,000. And we
have to normalize the value $73,600 using z-score normalization.

Decimal Scaling: This method normalizes the value of attribute A by moving the decimal point in the
value. This movement of a decimal point depends on the maximum absolute value of A. The formula for
the decimal scaling is given below:

Here j is the smallest integer such that max(|v'i|)<1


Salary bonus Formula CGPA Normalized after Decimal scaling
400 400 / 1000 0.4
310 310 / 1000 0.31
We will check the maximum value of our attribute ―salary bonus―. Here maximum value is 400 so we
can convert it into a decimal by dividing it by 1000. Why 1000? 400 contain three digits and we so we
can put three zeros after 1. So, it looks like 1000.
5. Data Discretization
This is a process of converting continuous data into a set of data intervals. Continuous attribute values are
substituted by small interval labels. This makes the data easier to study and analyze. If a data mining task
handles a continuous attribute, then its discrete values can be replaced by constant quality attributes. This
improves the efficiency of the task.
For example, the values for the age attribute can be replaced by the interval labels such as (0-10, 11-20…)
or (kid, youth, adult, senior).
6. Data Generalization
It converts low-level data attributes to high-level data attributes using concept hierarchy. This conversion
from a lower level to a higher conceptual level is useful to get a clearer picture of the data. Data
generalization can be divided into two approaches:
For example, age data can be in the form of (20, 30) in a dataset. It is transformed into a higher
conceptual level into a categorical value (young, old).
Data Reduction
Data reduction techniques ensure the integrity of data while reducing the data. Data reduction is a process
that reduces the volume of original data and represents it in a much smaller volume. Data reduction
techniques are used to obtain a reduced representation of the dataset that is much smaller in volume by
maintaining the integrity of the original data. By reducing the data, the efficiency of the data mining
process is improved, which produces the same analytical results.
Techniques of Data Reduction
Notes Prepared by Chandrakanta Mahanty, Assistant Professor
Department of CSE, 8093488380/8249119544 [email protected]
1. Dimensionality reduction is the process in which we reduced the number of unwanted variables,
attributes, and. Dimensionality reduction is a very important stage of data pre-processing. Dimensionality
reduction is considered a significant task in data mining applications. For example, let‘s start with an
example. Suppose you have a dataset with a lot of dimensions (features or columns in your database).

In this example, we can see that if we know the mobile number, then we can know the mobile network or
sim provider. So, we reduce a dimension of mobile network. When we reduce the dimensions, then you
can reduce those dimensions of attributes of data by combining the dimensions in such a way that it will
not lose significant characteristics of the original dataset that is going to be ready for data mining.
2. Numerosity Reduction:
Numerosity Reduction is a data reduction technique which replaces the original data by smaller form of
data representation. There are two techniques for numerosity reduction- Parametric and Non-Parametric
methods.
Parametric Methods –
For parametric methods, data is represented using some model. The model is used to estimate the data, so
that only parameters of data are required to be stored, instead of actual data. Regression and Log-Linear
methods are used for creating such models.
Non-Parametric Methods –
These methods are used for storing reduced representations of the data include histograms, clustering,
sampling and data cube aggregation.
Histograms: Histogram is the data representation in terms of frequency.
Clustering: Clustering divides the data into groups/clusters.
Sampling: Sampling can be used for data reduction because it allows a large data set to be represented by
a much smaller random data sample (or subset).
3. Data Cube Aggregation
This technique is used to aggregate data in a simpler form. Data Cube Aggregation is a multidimensional
aggregation that uses aggregation at various levels of a data cube to represent the original data set, thus
achieving data reduction.

Notes Prepared by Chandrakanta Mahanty, Assistant Professor


Department of CSE, 8093488380/8249119544 [email protected]
The data cube aggregation is a multidimensional aggregation that eases multidimensional analysis. The
data cube present precomputed and summarized data which eases the data mining into fast access.
4. Data Compression
Data compression employs modification, encoding, or converting the structure of data in a way that
consumes less space. Data compression involves building a compact representation of information by
removing redundancy and representing data in binary form.
Data that can be restored successfully from its compressed form is called Lossless compression. In
contrast, the opposite where it is not possible to restore the original form from the compressed form is
Lossy compression. Dimensionality and numerosity reduction method are also used for data compression.
This technique reduces the size of the files using different encoding mechanisms, such as Huffman
Encoding and run-length Encoding. We can divide it into two types based on their compression
techniques.
Lossless Compression: Encoding techniques (Run Length Encoding) allow a simple and minimal data
size reduction. Lossless data compression uses algorithms to restore the precise original data from the
compressed data.
Lossy Compression: In lossy-data compression, the decompressed data may differ from the original data
but are useful enough to retrieve information from them. For example, the JPEG image format is a lossy
compression, but we can find the meaning equivalent to the original image. Methods such as the Discrete
Wavelet transform technique PCA (principal component analysis) are examples of this compression.
5. Discretization & Concept Hierarchy Operation:
Techniques of data discretization are used to divide the attributes of the continuous nature into data with
intervals. We replace many constant values of the attributes by labels of small intervals.
Concept Hierarchies:
It reduces the data size by collecting and then replacing the low-level concepts (such as 43 for age) to
high-level concepts (categorical variables such as middle age or Senior).
Data Discretization
Data discretization refers to a method of converting a huge number of data values into smaller ones so
that the evaluation and management of data become easy. In other words, data discretization is a method
of converting attributes values of continuous data into a finite set of intervals with minimum data loss.
There are two forms of data discretization first is supervised discretization, and the second is
unsupervised discretization. Supervised discretization refers to a method in which the class data is used.
Unsupervised discretization refers to a method depending upon the way which operation proceeds. It
means it works on the top-down splitting strategy and bottom-up merging strategy.
Now, we can understand this concept with the help of an example
Suppose we have an attribute of Age with the given values
Age 1,5,9,4,7,11,14,17,13,18, 19,31,33,36,42,44,46,70,74,78,77
Table before Discretization
Attribute Age Age Age Age

1,5,4,9,7 11,14,17,13,18,19 31,33,36,42,44,46 70,74,77,78

After Discretization Child Young Mature Old


Another example is analytics, where we gather the static data of website visitors. For example, all visitors
who visit the site with the IP address of India are shown under country level.
Some Famous techniques of data discretization:
Histogram analysis: Histogram refers to a plot used to represent the underlying frequency distribution of
a continuous data set. Histogram assists the data inspection for data distribution. For example, Outliers,
skewness representation, normal distribution representation, etc.
Notes Prepared by Chandrakanta Mahanty, Assistant Professor
Department of CSE, 8093488380/8249119544 [email protected]
Binning: Binning refers to a data smoothing technique that helps to group a huge number of continuous
values into smaller values. For data discretization and the development of idea hierarchy, this technique
can also be used.
Cluster Analysis
Cluster analysis is a form of data discretization. A clustering algorithm is executed by dividing the values
of x numbers into clusters to isolate a computational feature of x.
Data discretization using decision tree analysis
Data discretization refers to a decision tree analysis in which a top-down slicing technique is used. It is
done through a supervised procedure. In a numeric attribute discretization, first, you need to select the
attribute that has the least entropy, and then you need to run it with the help of a recursive process. The
recursive process divides it into various discretized disjoint intervals, from top to bottom, using the same
splitting criterion.
Data discretization using correlation analysis
Discretizing data by linear regression technique, you can get the best neighboring interval, and then the
large intervals are combined to develop a larger overlap to form the final 20 overlapping intervals. It is a
supervised procedure.
Data discretization and concept hierarchy generation
The term hierarchy represents an organizational structure or mapping in which items are ranked according
to their levels of importance. For example, in computer science, there are different types of hierarchical
systems. A document is placed in a folder in windows at a specific place in the tree structure is the best
example of a computer hierarchical tree model. There are two types of hierarchy: top-down mapping and
the second one is bottom-up mapping.
Let's understand this concept hierarchy for the dimension location with the help of an example.
A particular city can map with the belonging country. For example, New Delhi can be mapped to India,
and India can be mapped to Asia.
Top-down mapping
Top-down mapping generally starts with the top with some general information and ends with the bottom
to the specialized information.
Bottom-up mapping
Bottom-up mapping generally starts with the bottom with some specialized information and ends with the
top to the generalized information.

Notes Prepared by Chandrakanta Mahanty, Assistant Professor


Department of CSE, 8093488380/8249119544 [email protected]

You might also like