0% found this document useful (0 votes)
2 views5 pages

Complete Data Science Questions

The document discusses various aspects of data science, including data preprocessing, data cleaning, handling outliers, and evaluation metrics like accuracy, precision, recall, and F1-score. It also covers techniques such as ROC curves, cross-validation, central tendency measures, hypothesis testing, and visualization using Matplotlib and Seaborn. Additionally, it explains algorithms like multiple linear regression, decision trees, and random forests, along with model selection techniques.

Uploaded by

Muhammed Minhaj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views5 pages

Complete Data Science Questions

The document discusses various aspects of data science, including data preprocessing, data cleaning, handling outliers, and evaluation metrics like accuracy, precision, recall, and F1-score. It also covers techniques such as ROC curves, cross-validation, central tendency measures, hypothesis testing, and visualization using Matplotlib and Seaborn. Additionally, it explains algorithms like multiple linear regression, decision trees, and random forests, along with model selection techniques.

Uploaded by

Muhammed Minhaj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Data Science Questions - 10 & 15 Marks

1. What is Data Preprocessing? Explain the steps involved.


Definition:
Data preprocessing is the initial stage in the data analysis pipeline where raw data is cleaned and transformed into

Steps Involved:
- Data Cleaning: Handle missing values, remove duplicates, correct errors.
- Data Transformation: Normalize or scale numerical data; encode categorical variables.
- Feature Engineering: Create new features, combine or split existing ones.
- Data Reduction: Use techniques like PCA, LDA, or feature selection to reduce dimensionality.
- Data Integration: Combine data from multiple sources.
- Data Discretization: Convert continuous data into categorical bins.

Importance:
Enhances data quality, reduces noise, and boosts model accuracy.

2. Define Data Cleaning and Discuss Its Tasks.


Definition:
Data cleaning is the process of correcting or removing inaccurate records from a dataset.

Tasks:
- Missing Values: Imputation (mean, median), deletion, or forward/backward filling.
- Outliers: Detect with z-score, boxplot; remove or transform.
- Noise Handling: Use smoothing techniques or binning.
- Normalization/Scaling: StandardScaler or MinMaxScaler to bring values to a common scale.
- Type Conversion & Deduplication: Convert data types; remove duplicate rows.

Purpose:
Improves data integrity and model reliability.

3. Techniques Used for Handling Outliers


Outliers are extreme values that differ significantly from the rest.

Detection Methods:
- Z-score: Values with z > 3 or z < -3 are considered outliers.
- IQR Method: Values outside Q1 - 1.5IQR or Q3 + 1.5IQR.
- Boxplots: Visual detection.
- Machine Learning Methods: Isolation Forest, One-Class SVM.

Handling Methods:
- Removal: If clearly erroneous.
- Transformation: Log, square root, or winsorization.
- Imputation: Replace with mean/median.

4. Differences Between Accuracy, Precision, Recall, and F1-Score


Metric | Formula | Focus
Accuracy | (TP + TN) / (TP + TN + FP + FN) | Overall correctness
Precision | TP / (TP + FP) | Quality of positive predictions
Recall | TP / (TP + FN) | Coverage of actual positives
F1-score | 2 * (Precision * Recall) / (P + R) | Balance between precision and recall

When to Use:
- Accuracy: For balanced datasets.
- Precision: When false positives are costly (e.g., spam detection).
- Recall: When false negatives are critical (e.g., disease diagnosis).
- F1-score: When you need a balance (imbalanced data).

5. ROC Curve and AUC in Binary Classification


ROC Curve:
Plots True Positive Rate (TPR) vs False Positive Rate (FPR) at different thresholds.

AUC (Area Under Curve):


- Ranges from 0 to 1.
- Higher AUC = better model.

Advantages Over Accuracy:


- Works well with imbalanced datasets.
- Evaluates performance across all thresholds.
- Highlights trade-off between sensitivity and specificity.

6. What is Cross-Validation? Types and Pros/Cons


Definition:
Cross-validation divides the data into parts, training and testing the model multiple times to get an average perform

Types:
- K-Fold: Divides data into k parts; trains on k-1, tests on 1.
- Stratified K-Fold: Preserves class distribution.
- Leave-One-Out (LOOCV): One sample for testing, rest for training.
- Repeated K-Fold: Repeats k-fold multiple times for reliability.

Advantages:
- Reduces overfitting.
- Provides robust performance estimate.

Disadvantages:
- Computationally expensive.
- May not suit small datasets.

1. Central Tendency and Dispersion Measures with Example


Central Tendency:
- Mean: Average value.
- Median: Middle value.
- Mode: Most frequent value.

Dispersion:
- Range: Max - Min.
- Variance: Average of squared differences from the mean.
- Standard Deviation: Square root of variance.

Example:
import numpy as np
scores = [45, 50, 55, 60, 65, 70, 75]
mean = np.mean(scores)
median = np.median(scores)
std_dev = np.std(scores)
print(mean, median, std_dev)

2. Hypothesis Testing with Example


Definition:
A method for making inferences about population parameters based on sample data.

Steps:
1. Formulate H0 and H1 (null and alternative).
2. Choose significance level (α = 0.05).
3. Select test (e.g., t-test).
4. Calculate test statistic.
5. Compare with critical value or p-value.
6. Interpret result.

Example: Testing whether a new drug lowers BP more than the old one using a two-sample t-test.

3. Matplotlib Plots with Code


import matplotlib.pyplot as plt

# Line plot
plt.plot([1, 2, 3], [4, 5, 6])
plt.title("Line Plot")
plt.show()

# Bar plot
plt.bar(['A', 'B', 'C'], [10, 20, 15])
plt.title("Bar Plot")
plt.show()

# Histogram
plt.hist([1,1,2,3,3,3,4,5])
plt.title("Histogram")
plt.show()

# Scatter plot
plt.scatter([1,2,3], [4,5,6])
plt.title("Scatter Plot")
plt.show()

4. Seaborn Plots with Code


import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd

# Sample data
df = sns.load_dataset("tips")

# Scatterplot
sns.scatterplot(x="total_bill", y="tip", data=df)
plt.title("Scatterplot")
plt.show()

# Heatmap
sns.heatmap(df.corr(), annot=True)
plt.title("Heatmap")
plt.show()

# Boxplot
sns.boxplot(x="day", y="total_bill", data=df)
plt.title("Boxplot")
plt.show()

# Violin plot
sns.violinplot(x="day", y="total_bill", data=df)
plt.title("Violin Plot")
plt.show()

5. Visualize and Remove Outliers Using Box Plot and Z-Score


import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

data = np.array([1, 2, 3, 4, 5, 100]) # Outlier = 100


z_scores = stats.zscore(data)
outliers = data[np.abs(z_scores) > 2]

# Boxplot
plt.boxplot(data)
plt.title("Box Plot")
plt.show()

# Remove outliers
cleaned = data[np.abs(z_scores) <= 2]
print("Cleaned data:", cleaned)

6. Multiple Linear Regression Algorithm and Assumptions


Algorithm:
Fit a linear equation y = β0 + β1x1 + β2x2 + ... + βnxn + ε.
Use Ordinary Least Squares (OLS) to minimize residual sum of squares.

Assumptions:
- Linearity
- Independence of errors
- Homoscedasticity (equal variance)
- Normal distribution of errors
- No multicollinearity

7. Decision Tree with Example


Definition:
A supervised ML algorithm that splits data based on feature conditions.

Example:
Predicting if a customer buys a car based on income and age.

from sklearn.tree import DecisionTreeClassifier


clf = DecisionTreeClassifier()
clf.fit(X_train, y_train)

Advantages: easy to interpret, non-linear modeling.

8. Random Forest with Example


Definition:
An ensemble method combining multiple decision trees.
Example:
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier()
clf.fit(X_train, y_train)

Advantages: better generalization, handles missing data and outliers.

9. Model Selection and Techniques


Definition:
Choosing the best model for a task.

Techniques:
- Cross-Validation
- Grid Search
- Random Search
- Bayesian Optimization
- AIC/BIC Scores
- Validation Curves

Goal:
Ensure the model generalizes well to unseen data.

You might also like