0% found this document useful (0 votes)
2 views14 pages

Data Science

The document provides an overview of data science concepts, including definitions, processes, and comparisons of various techniques and tools. It covers topics such as data cleaning, sampling, supervised and unsupervised learning, and the bias-variance trade-off. Additionally, it discusses the importance of statistical significance and provides examples of applications in data science.

Uploaded by

Partho Dey
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views14 pages

Data Science

The document provides an overview of data science concepts, including definitions, processes, and comparisons of various techniques and tools. It covers topics such as data cleaning, sampling, supervised and unsupervised learning, and the bias-variance trade-off. Additionally, it discusses the importance of statistical significance and provides examples of applications in data science.

Uploaded by

Partho Dey
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

19.

Statistical significance is
A. The science of collecting, organizing, and applying numerical facts
B. Measure of the probability that a certain hypothesis is incorrect given certain observations
C. One of the defining aspects of a data warehouse, which is specially built around all the
existing applications of the operational data
D. None of these
1) Explain Data Science

Ans:
Data Science is using data to find useful information. It combines math, coding, and machine
learning to solve problems.

Steps:

1.​ Collect – Get data.


2.​ Clean – Fix mistakes.
3.​ Analyze – Find patterns.
4.​ Model – Make predictions.
5.​ Use – Apply results.

In short Collect → Clean → Analyze → Model → Use

Examples:

●​ Fraud detection
●​ Movie suggestions (Netflix)
●​ Health predictions

2)Compare box plot and histogram.


Ans:

Box Plot Histogram

Purpose is Summarizes data Shows data distribution and frequency

Data Type Continuous Continuous and discrete

Granularity Low High

Detailed distribution analysis


Use Case is Summary and comparison

Displays Median, quartiles, IQR, outliers Frequency in bins


3) Explain briefly about Data science tools.
Ans:

Category Tools

Programming Language Python , R

Data Manipulation Pandas, SQL

Databases SQL, NoSQL

Cloud Google Cloud, AWS

Data Storage MySQL, MongoDB

IDEs Jupyter Notebooks , VS Code

4) Interpret applications of classification with example


Ans:
Email Spam Detection: Classifying emails as spam or not spam
Customer Segmentation: Grouping customers by purchasing behavior.
Speech Recognition: Converting spoken language into text.
Image Recognition: Detecting objects in images.
Medical Diagnosis: Classifying diseases as present or absent (e.g., cancer detection).

5) List down the conditions for Overfitting and Underfitting.


Ans:
Overfitting Conditions:
Complex model
Small training data
Too much training
Too many parameters
Insufficient data
Lack of regularization
Too many features

Underfitting Conditions:

Too simple a model

Insufficient training
Too much regularization

Too few features

Low model complexity relative to data complexity

Data preprocessing issues

6) Summarize the reason why Python is used for data cleaning in Data Science
Ans:
Python is used for data cleaning because:

Libraries: Pandas and NumPy make it easy .

Flexibility: Works with different data formats


Efficiency: Handles large datasets quickly.
Support: Lots of resources and help from the community.
Automation: Can automate cleaning tasks.
Community Support

7) Define data analytics.


Ans:
Data analytics is the process of using tools and techniques to analyze data to find patterns.
It involves using various techniques and tools to: Collect , Clean , Analyze , Describe

Advantage : Improved Efficiency , Risk Management , Enhanced Operational Performance


,Better Financial Management

8) Illustrate supervised and unsupervised learning.

Ans:
Supervised Learning:
Supervised learning is a machine learning technique that uses labeled data to train algorithms
to predict outcomes
You train the model with labeled data (input and the correct output).
Goal: The model learns to predict the output for new, unseen data.
Example: Email Spam Classification

Unsupervised Learning:
Unsupervised learning is a machine learning technique that analyzes data without human
intervention.
You train the model with unlabeled data (just inputs, no outputs).
Goal: The model finds hidden patterns or groups in the data.
Example: Customer Segmentation

9) Write briefly about Data mining concept.


Ans:

Data mining is the process of discovering hidden patterns, trends, and valuable information
within large datasets.

Data mining is the process of sorting through large data sets to identify patterns and
relationships that can help solve business problems through data analysis

Data Collection: collect data from various sources


Analysis : Using techniques like machine learning, and algorithms to find patterns
Goal: Extract valuable knowledge

Common Techniques in Data Mining: Classification , Clustering , Association

10) Organize Data cleaning and sampling with an example


Ans:

Data Cleaning

Data cleaning is the process of fixing or removing incorrect, incomplete, or duplicate data to
improve its quality.

Example:

●​ Removing duplicate customer records.


●​ Filling missing phone numbers.
●​ Correcting invalid email formats.

Sampling

Sampling is the process of selecting a smaller, representative subset of data from a larger
dataset for analysis.

Example:
●​ From a dataset of 100,000 customers, you randomly select a sample of 1,000 customers
to analyze sales patterns.

11) Explain briefly about Data Science.


Ans: same number 1

12)List down the conditions for Overfitting and Underfitting.


Ans: Same

13) Explain briefly about the libraries used in Data Science


Ans:
Pandas: For working with data
NumPy: For math and working with numbers.
Matplotlib: For making charts and graphs.
Seaborn: For beautiful, easy-to-read charts.
Scikit-learn: For machine learning tasks
TensorFlow/Keras: For building smart models
SciPy: For solving scientific problems
Statsmodels: For statistical analysis.
Plotly: For interactive charts and graphs.
NLTK/SpaCy: For working with text data

14) Write a short note about Data cleaning.


Ans:
Data Cleaning is the process of identifying and correcting errors in a dataset to improve its
quality. The goal is to ensure the data is accurate, complete, and ready for analysis.

Key Steps:

1.​ Remove duplicates.


2.​ Handle missing values.
3.​ Correct errors
4.​ Standardize formats.
5.​ Remove outliers.

15) Write a brief note about Data sampling

Ans:
Data Sampling is the process of selecting a smallerdata from a larger dataset for analysis. It
helps in making analysis more manageable, especially when dealing with large datasets.

Types:

1.​ Random Sampling: Random selection of data points.


2.​ Stratified Sampling: Sampling from specific groups.
3.​ Systematic Sampling: Selecting every nth data point.
4.​ Convenience Sampling: Choosing easily available data.

16) How can outlier values can be determined?

Ans:

Z-Score → If >3 or <-3.

IQR Method → Outside Q1 - 1.5×IQR or Q3 + 1.5×IQR.

Box Plot → Dots outside whiskers.

ML Methods → Isolation Forest, DBSCAN.

Visual Check → Scatter plots, histograms.

17) Compare between data analytics and data science.

Ans:

Data Analytics Data Science

Focus on "What happened?" "What will happen?"

structured data Both structured and unstructured data

Tools - Excel , SQL Python, R

lower complexity higher complexity

Jobs - Data Analyst Data Scientist

Methods - Reports, charts, SQL. AI, ML, coding.

Finds patterns in past data. Predicts future trends.


18) Explain briefly about Eigenvectors and Eigenvalues.

Ans:

Eigenvectors are special vectors that don’t change direction when a matrix is applied to
them—only their length changes.

Eigenvalues are the numbers that tell us how much the eigenvector is stretched

19) Interpret what do you understand by Imbalanced Data?


Ans:
Imbalanced data means one type of data appears much more than the other.

Example:

●​ Fraud detection: 99 out of 100 transactions are normal, only 1 is fraud.


●​ Medical tests: 95 out of 100 people are healthy, only 5 are sick.

Advantages

Matches real-world data (e.g., fraud, diseases).​


Helps find rare but important cases.​
Faster training with more common data.

Disadvantages

Model may ignore rare cases.​


Harder to train good models.
20) Compare expected value and mean value.
Ans:

Expected Value Mean Value

Predicted average Actual average

Future predictions Past data analysis

Example - Average roll of a dice = 3.5 Rolling a dice 10 times and averaging results

Used to predict long-term average outcomes Used to summarize a given dataset.

Depends on Probability of values Total sum of values

Stays the same for a given probability Changes with different data samples

21) Define bias-variance trade-off.


Ans:

Bias-Variance Trade-off

It is about finding the right balance in a model:

●​ Bias (Too simple) → Model makes mistakes because it doesn’t learn enough.
●​ Variance (Too complex) → Model learns too much, and makes mistakes on new data.

Goal:

Find a balance where the model is not too simple or too complex.

22) Define the confusion matrix.

Ans:

Confusion Matrix

A confusion matrix helps check how well a model predicts things. It compares actual vs.
predicted results.
Table Example:
Predicted: Yes Predicted: No

Actual: Yes Correct (TP) Wrong (FN)

Actual: No Wrong (FP) Correct (TN)

Simple Meaning:

●​ True Positive (TP) → Model is right


●​ True Negative (TN) → Model is right
●​ False Positive (FP) → Model is wrong
●​ False Negative (FN) → Model is wrong

23) List the major drawbacks in Linear model.

Ans:

●​ Needs a straight-line pattern


●​ Limited flexibility
●​ Can’t capture variable interactions
●​ Overfits with too many features
●​ Not good for categories
●​ Struggles with related inputs

24) Develop RMSE and MSE in a linear regression model.

Ans:
25) Compare between correlation and covariance

Ans:

Correlation Covariance

Shows how strongly two variables are Shows how two variables change together.
related.

Range Between -1 and +1. Any value (positive or negative).

Independent of units Depends on units

+1 (Strong positive), 0 (No relation), -1 Positive (Move together), Negative (Move


(Strong negative). opposite).

Uses - Comparing relationships Checking variable movement.

Standardized Not standardized

You might also like