0% found this document useful (0 votes)

2 views14 pages

Data Science

The document provides an overview of data science concepts, including definitions, processes, and comparisons of various techniques and tools. It covers topics such as data cleaning, sampling, supervised and unsupervised learning, and the bias-variance trade-off. Additionally, it discusses the importance of statistical significance and provides examples of applications in data science.

Uploaded by

Partho Dey

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views14 pages

Data Science

Uploaded by

Partho Dey

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

19.

Statistical significance is
A. The science of collecting, organizing, and applying numerical facts
B. Measure of the probability that a certain hypothesis is incorrect given certain observations
C. One of the defining aspects of a data warehouse, which is specially built around all the
existing applications of the operational data
D. None of these
1) Explain Data Science

Ans:
Data Science is using data to find useful information. It combines math, coding, and machine
learning to solve problems.

Steps:

1. Collect – Get data.

2. Clean – Fix mistakes.
3. Analyze – Find patterns.
4. Model – Make predictions.
5. Use – Apply results.

In short Collect → Clean → Analyze → Model → Use

Examples:

● Fraud detection
● Movie suggestions (Netflix)
● Health predictions

2)Compare box plot and histogram.

Ans:

Box Plot Histogram

Purpose is Summarizes data Shows data distribution and frequency

Data Type Continuous Continuous and discrete

Granularity Low High

Detailed distribution analysis

Use Case is Summary and comparison

Displays Median, quartiles, IQR, outliers Frequency in bins

3) Explain briefly about Data science tools.
Ans:

Category Tools

Programming Language Python , R

Data Manipulation Pandas, SQL

Databases SQL, NoSQL

Cloud Google Cloud, AWS

Data Storage MySQL, MongoDB

IDEs Jupyter Notebooks , VS Code

4) Interpret applications of classification with example

Ans:
Email Spam Detection: Classifying emails as spam or not spam
Customer Segmentation: Grouping customers by purchasing behavior.
Speech Recognition: Converting spoken language into text.
Image Recognition: Detecting objects in images.
Medical Diagnosis: Classifying diseases as present or absent (e.g., cancer detection).

5) List down the conditions for Overfitting and Underfitting.

Ans:
Overfitting Conditions:
Complex model
Small training data
Too much training
Too many parameters
Insufficient data
Lack of regularization
Too many features

Underfitting Conditions:

Too simple a model

Insufficient training
Too much regularization

Too few features

Low model complexity relative to data complexity

Data preprocessing issues

6) Summarize the reason why Python is used for data cleaning in Data Science
Ans:
Python is used for data cleaning because:

Libraries: Pandas and NumPy make it easy .

Flexibility: Works with different data formats

Efficiency: Handles large datasets quickly.
Support: Lots of resources and help from the community.
Automation: Can automate cleaning tasks.
Community Support

7) Define data analytics.

Ans:
Data analytics is the process of using tools and techniques to analyze data to find patterns.
It involves using various techniques and tools to: Collect , Clean , Analyze , Describe

Advantage : Improved Efficiency , Risk Management , Enhanced Operational Performance

,Better Financial Management

8) Illustrate supervised and unsupervised learning.

Ans:
Supervised Learning:
Supervised learning is a machine learning technique that uses labeled data to train algorithms
to predict outcomes
You train the model with labeled data (input and the correct output).
Goal: The model learns to predict the output for new, unseen data.
Example: Email Spam Classification

Unsupervised Learning:
Unsupervised learning is a machine learning technique that analyzes data without human
intervention.
You train the model with unlabeled data (just inputs, no outputs).
Goal: The model finds hidden patterns or groups in the data.
Example: Customer Segmentation

9) Write briefly about Data mining concept.

Ans:

Data mining is the process of discovering hidden patterns, trends, and valuable information
within large datasets.

Data mining is the process of sorting through large data sets to identify patterns and
relationships that can help solve business problems through data analysis

Data Collection: collect data from various sources

Analysis : Using techniques like machine learning, and algorithms to find patterns
Goal: Extract valuable knowledge

Common Techniques in Data Mining: Classification , Clustering , Association

10) Organize Data cleaning and sampling with an example

Ans:

Data Cleaning

Data cleaning is the process of fixing or removing incorrect, incomplete, or duplicate data to
improve its quality.

Example:

● Removing duplicate customer records.

● Filling missing phone numbers.
● Correcting invalid email formats.

Sampling

Sampling is the process of selecting a smaller, representative subset of data from a larger
dataset for analysis.

Example:
● From a dataset of 100,000 customers, you randomly select a sample of 1,000 customers
to analyze sales patterns.

11) Explain briefly about Data Science.

Ans: same number 1

12)List down the conditions for Overfitting and Underfitting.

Ans: Same

13) Explain briefly about the libraries used in Data Science

Ans:
Pandas: For working with data
NumPy: For math and working with numbers.
Matplotlib: For making charts and graphs.
Seaborn: For beautiful, easy-to-read charts.
Scikit-learn: For machine learning tasks
TensorFlow/Keras: For building smart models
SciPy: For solving scientific problems
Statsmodels: For statistical analysis.
Plotly: For interactive charts and graphs.
NLTK/SpaCy: For working with text data

14) Write a short note about Data cleaning.

Ans:
Data Cleaning is the process of identifying and correcting errors in a dataset to improve its
quality. The goal is to ensure the data is accurate, complete, and ready for analysis.

Key Steps:

1. Remove duplicates.

2. Handle missing values.
3. Correct errors
4. Standardize formats.
5. Remove outliers.

15) Write a brief note about Data sampling

Ans:
Data Sampling is the process of selecting a smallerdata from a larger dataset for analysis. It
helps in making analysis more manageable, especially when dealing with large datasets.

Types:

1. Random Sampling: Random selection of data points.

2. Stratified Sampling: Sampling from specific groups.
3. Systematic Sampling: Selecting every nth data point.
4. Convenience Sampling: Choosing easily available data.

16) How can outlier values can be determined?

Ans:

Z-Score → If >3 or <-3.

IQR Method → Outside Q1 - 1.5×IQR or Q3 + 1.5×IQR.

Box Plot → Dots outside whiskers.

ML Methods → Isolation Forest, DBSCAN.

Visual Check → Scatter plots, histograms.

17) Compare between data analytics and data science.

Ans:

Data Analytics Data Science

Focus on "What happened?" "What will happen?"

structured data Both structured and unstructured data

Tools - Excel , SQL Python, R

lower complexity higher complexity

Jobs - Data Analyst Data Scientist

Methods - Reports, charts, SQL. AI, ML, coding.

Finds patterns in past data. Predicts future trends.

18) Explain briefly about Eigenvectors and Eigenvalues.

Ans:

Eigenvectors are special vectors that don’t change direction when a matrix is applied to
them—only their length changes.

Eigenvalues are the numbers that tell us how much the eigenvector is stretched

19) Interpret what do you understand by Imbalanced Data?

Ans:
Imbalanced data means one type of data appears much more than the other.

Example:

● Fraud detection: 99 out of 100 transactions are normal, only 1 is fraud.

● Medical tests: 95 out of 100 people are healthy, only 5 are sick.

Advantages

Matches real-world data (e.g., fraud, diseases).

Helps find rare but important cases.
Faster training with more common data.

Disadvantages

Model may ignore rare cases.

Harder to train good models.
20) Compare expected value and mean value.
Ans:

Expected Value Mean Value

Predicted average Actual average

Future predictions Past data analysis

Example - Average roll of a dice = 3.5 Rolling a dice 10 times and averaging results

Used to predict long-term average outcomes Used to summarize a given dataset.

Depends on Probability of values Total sum of values

Stays the same for a given probability Changes with different data samples

21) Define bias-variance trade-off.

Ans:

Bias-Variance Trade-off

It is about finding the right balance in a model:

● Bias (Too simple) → Model makes mistakes because it doesn’t learn enough.
● Variance (Too complex) → Model learns too much, and makes mistakes on new data.

Goal:

Find a balance where the model is not too simple or too complex.

22) Define the confusion matrix.

Ans:

Confusion Matrix

A confusion matrix helps check how well a model predicts things. It compares actual vs.
predicted results.
Table Example:
Predicted: Yes Predicted: No

Actual: Yes Correct (TP) Wrong (FN)

Actual: No Wrong (FP) Correct (TN)

Simple Meaning:

● True Positive (TP) → Model is right

● True Negative (TN) → Model is right
● False Positive (FP) → Model is wrong
● False Negative (FN) → Model is wrong

23) List the major drawbacks in Linear model.

Ans:

● Needs a straight-line pattern

● Limited flexibility
● Can’t capture variable interactions
● Overfits with too many features
● Not good for categories
● Struggles with related inputs

24) Develop RMSE and MSE in a linear regression model.

Ans:
25) Compare between correlation and covariance

Ans:

Correlation Covariance

Shows how strongly two variables are Shows how two variables change together.
related.

Range Between -1 and +1. Any value (positive or negative).

Independent of units Depends on units

+1 (Strong positive), 0 (No relation), -1 Positive (Move together), Negative (Move

(Strong negative). opposite).

Uses - Comparing relationships Checking variable movement.

Standardized Not standardized

TYCS Data Science Questions Bank
No ratings yet
TYCS Data Science Questions Bank
3 pages
100 Data Science Interview Questions and Answers
No ratings yet
100 Data Science Interview Questions and Answers
33 pages
FDS - Unit 1 Question Bank
No ratings yet
FDS - Unit 1 Question Bank
16 pages
Company Wise Data Science Interview Questions
100% (2)
Company Wise Data Science Interview Questions
39 pages
Data Science
No ratings yet
Data Science
10 pages
ixs8h-l8mgc
No ratings yet
ixs8h-l8mgc
40 pages
Big Data (Imp-Questions)
No ratings yet
Big Data (Imp-Questions)
17 pages
question bank with answers
No ratings yet
question bank with answers
103 pages
2 Marks With Answers
No ratings yet
2 Marks With Answers
39 pages
FDS UNIT 1 QB
No ratings yet
FDS UNIT 1 QB
7 pages
data science
No ratings yet
data science
28 pages
01.ad3491 Fdsa QB
No ratings yet
01.ad3491 Fdsa QB
16 pages
DA_1733591326
No ratings yet
DA_1733591326
132 pages
DS 3-MARKS SEMESETER SUGGESTION (2)
No ratings yet
DS 3-MARKS SEMESETER SUGGESTION (2)
54 pages
Ds Revision 1
No ratings yet
Ds Revision 1
5 pages
Crack_Data_Science_Interview_�_1731300339
No ratings yet
Crack_Data_Science_Interview_�_1731300339
132 pages
Top Data Science Interview Questions and Answers in 2023 PDF
100% (1)
Top Data Science Interview Questions and Answers in 2023 PDF
14 pages
ADS-IMP-QNA-2025-15-04-06-06-35_copy
No ratings yet
ADS-IMP-QNA-2025-15-04-06-06-35_copy
33 pages
AD3491 - Unit 1 - Introduction to Data Science Important Questions 2 Marks With Answer --3-8
No ratings yet
AD3491 - Unit 1 - Introduction to Data Science Important Questions 2 Marks With Answer --3-8
6 pages
DA (All CHP.)
No ratings yet
DA (All CHP.)
14 pages
Data Minig Anwers
No ratings yet
Data Minig Anwers
37 pages
Data Science Interview Questions
No ratings yet
Data Science Interview Questions
32 pages
Data-Science-and-Analytics-Reviewer
No ratings yet
Data-Science-and-Analytics-Reviewer
5 pages
Da CH1 Slqa
No ratings yet
Da CH1 Slqa
6 pages
Ml Chapter 2
No ratings yet
Ml Chapter 2
9 pages
Unit I 2 Marks
No ratings yet
Unit I 2 Marks
5 pages
OCS353 Data Science Fundamentals QB_(Common to EEE,Mech,Civil)
No ratings yet
OCS353 Data Science Fundamentals QB_(Common to EEE,Mech,Civil)
7 pages
Data Science Assignment
No ratings yet
Data Science Assignment
9 pages
7 - Foundations of DS
No ratings yet
7 - Foundations of DS
8 pages
Data Science Interview Best
No ratings yet
Data Science Interview Best
48 pages
12 2marks With Ans
No ratings yet
12 2marks With Ans
21 pages
Unit I
No ratings yet
Unit I
52 pages
DS Unit 1 Essay Answers.
No ratings yet
DS Unit 1 Essay Answers.
18 pages
12 2marks With Ans
No ratings yet
12 2marks With Ans
21 pages
Ch-04: Data and Analysis - Short Question and Answers | PDF
No ratings yet
Ch-04: Data and Analysis - Short Question and Answers | PDF
10 pages
Class 9 (Chap #4)
No ratings yet
Class 9 (Chap #4)
9 pages
DS Unit 1
No ratings yet
DS Unit 1
35 pages
Data Science_notes_X
No ratings yet
Data Science_notes_X
3 pages
DSBDA 4
No ratings yet
DSBDA 4
16 pages
DS Final 3 Marks
No ratings yet
DS Final 3 Marks
10 pages
sfds aat
No ratings yet
sfds aat
8 pages
DS
No ratings yet
DS
7 pages
Common DS Interview Questions and Answers - 1
No ratings yet
Common DS Interview Questions and Answers - 1
4 pages
Summary Business Analytics
No ratings yet
Summary Business Analytics
24 pages
DS PPT 1
No ratings yet
DS PPT 1
30 pages
Big Data
No ratings yet
Big Data
5 pages
FDS-1
No ratings yet
FDS-1
5 pages
II CSE_A&B (96)DS-int 1 QP ANS-set1 - Copy
No ratings yet
II CSE_A&B (96)DS-int 1 QP ANS-set1 - Copy
7 pages
Data Science Process Stages Lecture 2
No ratings yet
Data Science Process Stages Lecture 2
4 pages
Ads TopperSh
No ratings yet
Ads TopperSh
50 pages
Data Science MCQs Sample Mid2xlsx 2024 11-29-23!19!54
No ratings yet
Data Science MCQs Sample Mid2xlsx 2024 11-29-23!19!54
8 pages
Foundation of Data Science previous year question paper
No ratings yet
Foundation of Data Science previous year question paper
40 pages
Fd45092a Ccad 459e Bc18 b01536fd6bac Untitled
No ratings yet
Fd45092a Ccad 459e Bc18 b01536fd6bac Untitled
53 pages
109 Data Science Interview Questions and Answers _ Springboard Blog
No ratings yet
109 Data Science Interview Questions and Answers _ Springboard Blog
11 pages
Unit 4
No ratings yet
Unit 4
10 pages
25 Important Data Science Interview Questions 1719736087
No ratings yet
25 Important Data Science Interview Questions 1719736087
15 pages
X AI SS CH4 NOTES
No ratings yet
X AI SS CH4 NOTES
5 pages
BDA ANSWERS (1)
No ratings yet
BDA ANSWERS (1)
18 pages
Class 10 - AI STUDY MATERIAL 19.08.2024 - Removed (1) - Removed
No ratings yet
Class 10 - AI STUDY MATERIAL 19.08.2024 - Removed (1) - Removed
2 pages
Exploring the World of Data Science and Machine Learning
From Everand
Exploring the World of Data Science and Machine Learning
NIBEDITA Sahu
No ratings yet
Introduction To Machine Learning
No ratings yet
Introduction To Machine Learning
18 pages
Unit 3 Introduction to Deep Learning part 1
No ratings yet
Unit 3 Introduction to Deep Learning part 1
7 pages
DCSN 216 Summary
No ratings yet
DCSN 216 Summary
19 pages
DS FML QB Bat20 PDF
No ratings yet
DS FML QB Bat20 PDF
51 pages
Kaspersky Lab Whitepaper Machine Learning
No ratings yet
Kaspersky Lab Whitepaper Machine Learning
17 pages
MasterThesisDoc
No ratings yet
MasterThesisDoc
55 pages
J Adv Manuf Process - 2023 - Plathottam - A Review of Artificial Intelligence Applications in Manufacturing Operations
No ratings yet
J Adv Manuf Process - 2023 - Plathottam - A Review of Artificial Intelligence Applications in Manufacturing Operations
19 pages
Chapter2 (Classification)
No ratings yet
Chapter2 (Classification)
17 pages
Introduction To Statistical Learning
No ratings yet
Introduction To Statistical Learning
16 pages
Exam 2018 With Answers
No ratings yet
Exam 2018 With Answers
31 pages
DATA ANAYTICS Notes UNIT4
100% (1)
DATA ANAYTICS Notes UNIT4
45 pages
Industrial Training Report
No ratings yet
Industrial Training Report
31 pages
urd new report 11
No ratings yet
urd new report 11
64 pages
Spam Email Classifier
No ratings yet
Spam Email Classifier
16 pages
Mini Project
No ratings yet
Mini Project
59 pages
Electronics: Network Slicing For Beyond 5G Systems: An Overview of The Smart Port Use Case
No ratings yet
Electronics: Network Slicing For Beyond 5G Systems: An Overview of The Smart Port Use Case
17 pages
Fiches Machine Learning
No ratings yet
Fiches Machine Learning
21 pages
An Acoustic Approach to Drone Indentification Using Machine Learning
No ratings yet
An Acoustic Approach to Drone Indentification Using Machine Learning
129 pages
3 Naive Bayes Full
No ratings yet
3 Naive Bayes Full
41 pages
1000 Machine Learning MCQ (Multiple Choice Questions) - Sanfoundry
No ratings yet
1000 Machine Learning MCQ (Multiple Choice Questions) - Sanfoundry
16 pages
Q1-What's The Trade-Off Between Bias and Variance?
100% (1)
Q1-What's The Trade-Off Between Bias and Variance?
5 pages
Machine and Deep Learning Algorithms and Applications
No ratings yet
Machine and Deep Learning Algorithms and Applications
123 pages
Excel Cheat Sheet
No ratings yet
Excel Cheat Sheet
61 pages
Machine Learning in Antenna Design: An Overview On Machine Learning Concept and Algorithms
No ratings yet
Machine Learning in Antenna Design: An Overview On Machine Learning Concept and Algorithms
9 pages
DSA2324 Lecture 01 Introduction To Data Science
No ratings yet
DSA2324 Lecture 01 Introduction To Data Science
96 pages
Lecture 10 - AI vs ML vs DL - Classification
No ratings yet
Lecture 10 - AI vs ML vs DL - Classification
34 pages
DS&ML 1
No ratings yet
DS&ML 1
9 pages
Data Mining
No ratings yet
Data Mining
7 pages
BUSINESS ANALYTICS Assignment
No ratings yet
BUSINESS ANALYTICS Assignment
14 pages
Lecture 1
100% (1)
Lecture 1
81 pages

Uploaded by

Uploaded by

19.

1.​ Collect – Get data.

In short Collect → Clean → Analyze → Model → Use

2)Compare box plot and histogram.

Box Plot Histogram

Purpose is Summarizes data Shows data distribution and frequency

Data Type Continuous Continuous and discrete

Granularity Low High

Detailed distribution analysis

Displays Median, quartiles, IQR, outliers Frequency in bins

Programming Language Python , R

Data Manipulation Pandas, SQL

Databases SQL, NoSQL

Cloud Google Cloud, AWS

Data Storage MySQL, MongoDB

IDEs Jupyter Notebooks , VS Code

4) Interpret applications of classification with example

5) List down the conditions for Overfitting and Underfitting.

Too simple a model

Too few features

Low model complexity relative to data complexity

Data preprocessing issues

Libraries: Pandas and NumPy make it easy .

Flexibility: Works with different data formats

7) Define data analytics.

Advantage : Improved Efficiency , Risk Management , Enhanced Operational Performance

8) Illustrate supervised and unsupervised learning.

9) Write briefly about Data mining concept.

Data Collection: collect data from various sources

Common Techniques in Data Mining: Classification , Clustering , Association

10) Organize Data cleaning and sampling with an example

●​ Removing duplicate customer records.

11) Explain briefly about Data Science.

12)List down the conditions for Overfitting and Underfitting.

13) Explain briefly about the libraries used in Data Science

14) Write a short note about Data cleaning.

1.​ Remove duplicates.

15) Write a brief note about Data sampling

1.​ Random Sampling: Random selection of data points.

16) How can outlier values can be determined?

Z-Score → If >3 or <-3.

IQR Method → Outside Q1 - 1.5×IQR or Q3 + 1.5×IQR.

Box Plot → Dots outside whiskers.

ML Methods → Isolation Forest, DBSCAN.

Visual Check → Scatter plots, histograms.

17) Compare between data analytics and data science.

Data Analytics Data Science

Focus on "What happened?" "What will happen?"

structured data Both structured and unstructured data

Tools - Excel , SQL Python, R

lower complexity higher complexity

Jobs - Data Analyst Data Scientist

Methods - Reports, charts, SQL. AI, ML, coding.

Finds patterns in past data. Predicts future trends.

19) Interpret what do you understand by Imbalanced Data?

●​ Fraud detection: 99 out of 100 transactions are normal, only 1 is fraud.

Matches real-world data (e.g., fraud, diseases).​

Model may ignore rare cases.​

Expected Value Mean Value

Predicted average Actual average

Future predictions Past data analysis

Used to predict long-term average outcomes Used to summarize a given dataset.

Depends on Probability of values Total sum of values

21) Define bias-variance trade-off.

It is about finding the right balance in a model:

22) Define the confusion matrix.

Actual: Yes Correct (TP) Wrong (FN)

Actual: No Wrong (FP) Correct (TN)

●​ True Positive (TP) → Model is right

23) List the major drawbacks in Linear model.

●​ Needs a straight-line pattern

24) Develop RMSE and MSE in a linear regression model.

Range Between -1 and +1. Any value (positive or negative).

Independent of units Depends on units

+1 (Strong positive), 0 (No relation), -1 Positive (Move together), Negative (Move

Uses - Comparing relationships Checking variable movement.

Standardized Not standardized

You might also like

1. Collect – Get data.

● Removing duplicate customer records.

1. Remove duplicates.

1. Random Sampling: Random selection of data points.

● Fraud detection: 99 out of 100 transactions are normal, only 1 is fraud.

Matches real-world data (e.g., fraud, diseases).

Model may ignore rare cases.

● True Positive (TP) → Model is right

● Needs a straight-line pattern