Complete Data Science Questions
Complete Data Science Questions
Steps Involved:
- Data Cleaning: Handle missing values, remove duplicates, correct errors.
- Data Transformation: Normalize or scale numerical data; encode categorical variables.
- Feature Engineering: Create new features, combine or split existing ones.
- Data Reduction: Use techniques like PCA, LDA, or feature selection to reduce dimensionality.
- Data Integration: Combine data from multiple sources.
- Data Discretization: Convert continuous data into categorical bins.
Importance:
Enhances data quality, reduces noise, and boosts model accuracy.
Tasks:
- Missing Values: Imputation (mean, median), deletion, or forward/backward filling.
- Outliers: Detect with z-score, boxplot; remove or transform.
- Noise Handling: Use smoothing techniques or binning.
- Normalization/Scaling: StandardScaler or MinMaxScaler to bring values to a common scale.
- Type Conversion & Deduplication: Convert data types; remove duplicate rows.
Purpose:
Improves data integrity and model reliability.
Detection Methods:
- Z-score: Values with z > 3 or z < -3 are considered outliers.
- IQR Method: Values outside Q1 - 1.5IQR or Q3 + 1.5IQR.
- Boxplots: Visual detection.
- Machine Learning Methods: Isolation Forest, One-Class SVM.
Handling Methods:
- Removal: If clearly erroneous.
- Transformation: Log, square root, or winsorization.
- Imputation: Replace with mean/median.
When to Use:
- Accuracy: For balanced datasets.
- Precision: When false positives are costly (e.g., spam detection).
- Recall: When false negatives are critical (e.g., disease diagnosis).
- F1-score: When you need a balance (imbalanced data).
Types:
- K-Fold: Divides data into k parts; trains on k-1, tests on 1.
- Stratified K-Fold: Preserves class distribution.
- Leave-One-Out (LOOCV): One sample for testing, rest for training.
- Repeated K-Fold: Repeats k-fold multiple times for reliability.
Advantages:
- Reduces overfitting.
- Provides robust performance estimate.
Disadvantages:
- Computationally expensive.
- May not suit small datasets.
Dispersion:
- Range: Max - Min.
- Variance: Average of squared differences from the mean.
- Standard Deviation: Square root of variance.
Example:
import numpy as np
scores = [45, 50, 55, 60, 65, 70, 75]
mean = np.mean(scores)
median = np.median(scores)
std_dev = np.std(scores)
print(mean, median, std_dev)
Steps:
1. Formulate H0 and H1 (null and alternative).
2. Choose significance level (α = 0.05).
3. Select test (e.g., t-test).
4. Calculate test statistic.
5. Compare with critical value or p-value.
6. Interpret result.
Example: Testing whether a new drug lowers BP more than the old one using a two-sample t-test.
# Line plot
plt.plot([1, 2, 3], [4, 5, 6])
plt.title("Line Plot")
plt.show()
# Bar plot
plt.bar(['A', 'B', 'C'], [10, 20, 15])
plt.title("Bar Plot")
plt.show()
# Histogram
plt.hist([1,1,2,3,3,3,4,5])
plt.title("Histogram")
plt.show()
# Scatter plot
plt.scatter([1,2,3], [4,5,6])
plt.title("Scatter Plot")
plt.show()
# Sample data
df = sns.load_dataset("tips")
# Scatterplot
sns.scatterplot(x="total_bill", y="tip", data=df)
plt.title("Scatterplot")
plt.show()
# Heatmap
sns.heatmap(df.corr(), annot=True)
plt.title("Heatmap")
plt.show()
# Boxplot
sns.boxplot(x="day", y="total_bill", data=df)
plt.title("Boxplot")
plt.show()
# Violin plot
sns.violinplot(x="day", y="total_bill", data=df)
plt.title("Violin Plot")
plt.show()
# Boxplot
plt.boxplot(data)
plt.title("Box Plot")
plt.show()
# Remove outliers
cleaned = data[np.abs(z_scores) <= 2]
print("Cleaned data:", cleaned)
Assumptions:
- Linearity
- Independence of errors
- Homoscedasticity (equal variance)
- Normal distribution of errors
- No multicollinearity
Example:
Predicting if a customer buys a car based on income and age.
Techniques:
- Cross-Validation
- Grid Search
- Random Search
- Bayesian Optimization
- AIC/BIC Scores
- Validation Curves
Goal:
Ensure the model generalizes well to unseen data.