Data Science
Data Science
Statistical significance is
A. The science of collecting, organizing, and applying numerical facts
B. Measure of the probability that a certain hypothesis is incorrect given certain observations
C. One of the defining aspects of a data warehouse, which is specially built around all the
existing applications of the operational data
D. None of these
1) Explain Data Science
Ans:
Data Science is using data to find useful information. It combines math, coding, and machine
learning to solve problems.
Steps:
Examples:
● Fraud detection
● Movie suggestions (Netflix)
● Health predictions
Category Tools
Underfitting Conditions:
Insufficient training
Too much regularization
6) Summarize the reason why Python is used for data cleaning in Data Science
Ans:
Python is used for data cleaning because:
Ans:
Supervised Learning:
Supervised learning is a machine learning technique that uses labeled data to train algorithms
to predict outcomes
You train the model with labeled data (input and the correct output).
Goal: The model learns to predict the output for new, unseen data.
Example: Email Spam Classification
Unsupervised Learning:
Unsupervised learning is a machine learning technique that analyzes data without human
intervention.
You train the model with unlabeled data (just inputs, no outputs).
Goal: The model finds hidden patterns or groups in the data.
Example: Customer Segmentation
Data mining is the process of discovering hidden patterns, trends, and valuable information
within large datasets.
Data mining is the process of sorting through large data sets to identify patterns and
relationships that can help solve business problems through data analysis
Data Cleaning
Data cleaning is the process of fixing or removing incorrect, incomplete, or duplicate data to
improve its quality.
Example:
Sampling
Sampling is the process of selecting a smaller, representative subset of data from a larger
dataset for analysis.
Example:
● From a dataset of 100,000 customers, you randomly select a sample of 1,000 customers
to analyze sales patterns.
Key Steps:
Ans:
Data Sampling is the process of selecting a smallerdata from a larger dataset for analysis. It
helps in making analysis more manageable, especially when dealing with large datasets.
Types:
Ans:
Ans:
Ans:
Eigenvectors are special vectors that don’t change direction when a matrix is applied to
them—only their length changes.
Eigenvalues are the numbers that tell us how much the eigenvector is stretched
Example:
Advantages
Disadvantages
Example - Average roll of a dice = 3.5 Rolling a dice 10 times and averaging results
Stays the same for a given probability Changes with different data samples
Bias-Variance Trade-off
● Bias (Too simple) → Model makes mistakes because it doesn’t learn enough.
● Variance (Too complex) → Model learns too much, and makes mistakes on new data.
Goal:
Find a balance where the model is not too simple or too complex.
Ans:
Confusion Matrix
A confusion matrix helps check how well a model predicts things. It compares actual vs.
predicted results.
Table Example:
Predicted: Yes Predicted: No
Simple Meaning:
Ans:
Ans:
25) Compare between correlation and covariance
Ans:
Correlation Covariance
Shows how strongly two variables are Shows how two variables change together.
related.