0% found this document useful (0 votes)
13 views28 pages

data science

The document outlines key concepts in data science, including definitions of data science, big data, and types of data (structured, semi-structured, unstructured). It also covers the data science life cycle, the role of Python libraries like NumPy, Pandas, and Matplotlib, and various machine learning techniques such as supervised learning, clustering, and regression. Additionally, it discusses the importance of data in decision-making and model evaluation metrics.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views28 pages

data science

The document outlines key concepts in data science, including definitions of data science, big data, and types of data (structured, semi-structured, unstructured). It also covers the data science life cycle, the role of Python libraries like NumPy, Pandas, and Matplotlib, and various machine learning techniques such as supervised learning, clustering, and regression. Additionally, it discusses the importance of data in decision-making and model evaluation metrics.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

2-Mark Questions

1. What is data science?


Data science is the field that combines statistical analysis, computer science, and domain
knowledge to extract insights and knowledge from structured and unstructured data. It
involves processes like data collection, cleaning, analysis, and modeling.

2. Define big data.


Big data refers to extremely large and complex datasets that are difficult to process and
analyze using traditional data processing tools. It is characterized by the 3 Vs: Volume (large
amounts of data), Velocity (fast data generation and processing), and Variety (different data
types and sources)

3. Mention two uses of data analytics in business.


Customer Behavior Analysis – Helps businesses understand customer preferences and
improve marketing strategies.
Operational Efficiency – Identifies inefficiencies and optimizes processes to reduce costs
and improve performance.

4. What is structured data?


Structured data is highly organized data that is stored in a fixed format, such as rows and
columns in databases or spreadsheets, making it easy to enter, query, and analyze.

5. Give an example of unstructured data.


An example of unstructured data is a social media post, such as a tweet or a Facebook
status update, which may include text, images, and videos without a predefined format.

6. What is the purpose of the Jupiter Notebook?


Jupyter Notebook is an open-source tool used for writing and running code interactively. It
helps in data analysis, visualization, and documentation by combining code, text, and output
in one document.

7. What does NumPy stand for?

NumPy stands for Numerical Python. It is a Python library used for numerical computations,
especially working with arrays and matrices.

8. What is a Data Frame in pandas?


A DataFrame in pandas is a two-dimensional, size-mutable, and labeled data structure that
can store data of different types (e.g., integers, strings, and floats) in rows and columns,
similar to a table or spreadsheet.

9. Define correlation.
Correlation is a statistical measure that describes the strength and direction of the relationship
between two variables. A positive correlation means both variables increase together, while a
negative correlation means one variable increases as the other decreases.

10. What is a heat map?


A heat map is a data visualization tool that uses color gradients to represent the values of a
matrix or data. It is commonly used to display the intensity of data points, with varying colors
indicating different levels of values or frequencies.
11. What is supervised learning?
Supervised learning is a type of machine learning where the model is trained on labeled data,
meaning the input data is paired with the correct output. The goal is for the model to learn the
relationship between inputs and outputs to make accurate predictions on new, unseen data.

12. Define classification.


Classification is a type of supervised learning in which the goal is to assign data points into
predefined categories or classes based on input features. Examples include classifying emails
as spam or not spam, or identifying an image as a cat or a dog.

13. What is linear regression?


Linear regression is a statistical method used to model the relationship between a dependent
variable and one or more independent variables by fitting a linear equation to the observed
data. It aims to predict the value of the dependent variable based on the independent variable.

14. Name two supervised learning algorithms.


 Linear Regression
 Support Vector Machine (SVM)

15. What is clustering?


Clustering is an unsupervised machine learning technique that groups data points into clusters
based on their similarities. The goal is to organize data into meaningful groups where data
points in the same group are more similar to each other than to those in other groups.

16. Mention one use of k-means clustering.


One use of k-means clustering is customer segmentation in marketing, where customers are
grouped based on similar purchasing behaviors, allowing businesses to target specific groups
with tailored marketing strategies.

17. What is sentiment analysis?


Sentiment analysis is the process of using natural language processing (NLP) and machine
learning to identify and extract subjective information from text, determining whether the
sentiment expressed is positive, negative, or neutral. It is often used to analyze opinions in
social media, reviews, or customer feedback.

18. Define NLP.


NLP (Natural Language Processing) is a field of artificial intelligence that focuses on the
interaction between computers and human language. It involves tasks like text analysis,
speech recognition, and language generation to enable machines to understand, interpret, and
respond to human language.

19. What is tokenization?


Tokenization is the process of splitting text into smaller units called tokens, which can be
words, phrases, or characters. It is a key step in natural language processing (NLP) that helps
in analyzing and processing text data efficiently.

20. What is the role of data in decision-making?


Data plays a crucial role in decision-making by providing objective insights, identifying
trends, and supporting evidence-based choices. It helps businesses and individuals make
informed decisions, reduce uncertainties, and optimize strategies for better outcomes.
21. What is model overfitting?
Model overfitting occurs when a machine learning model learns the noise and details in the
training data too well, making it perform exceptionally well on the training set but poorly on
new, unseen data. It indicates that the model has become too complex and has learned
patterns that do not generalize to other data.

22. What is the purpose of cross-validation?


The purpose of cross-validation is to assess the performance and generalizability of a
machine learning model by dividing the data into multiple subsets, training the model on
some subsets, and testing it on others. This helps to prevent overfitting and provides a more
accurate evaluation of the model’s ability to perform on unseen data.

23. Mention one difference between training and testing data.


Training data is used to train a machine learning model, allowing it to learn patterns and
relationships, while testing data is used to evaluate the model's performance and
generalizability on unseen data.

24. What is a confusion matrix?


A confusion matrix is a table used to evaluate the performance of a classification model. It
shows the number of correct and incorrect predictions, with values for True Positives, True
Negatives, False Positives, and False Negatives, helping to assess accuracy, precision, recall,
and other performance metrics.

25. Define accuracy in model evaluation.


Accuracy is a metric used to evaluate a model's performance, calculated as the ratio of correct
predictions to the total number of predictions. It indicates the overall correctness of the
model, expressed as:
Accuracy=Correct Predictions
Total Predictions Accuracy Total Predictions Correct Predictions

26. What is the use of Matplotlib?


Matplotlib is a Python library used for creating static, interactive, and animated
visualizations, such as plots, charts, and graphs. It helps in visualizing data to identify
patterns, trends, and relationships in a more understandable and insightful way.

27. What is data wrangling?


Data wrangling is the process of cleaning, transforming, and organizing raw data into a
structured and usable format. It involves handling missing values, correcting errors, and
converting data types to prepare the data for analysis or modeling.

28. What is a scatter plot?


A scatter plot is a type of data visualization that displays individual data points on a two-
dimensional graph, with one variable plotted on the x-axis and the other on the y-axis. It is
used to show relationships or correlations between two variables.

29. Define hypothesis testing.


Hypothesis testing is a statistical method used to determine whether there is enough evidence
to reject a null hypothesis, based on sample data. It involves comparing the observed data to
what we would expect under the null hypothesis to make inferences about a population.
30. What is a p-value?
A p-value is a statistical measure that helps determine the significance of the results in
hypothesis testing. It represents the probability of obtaining results at least as extreme as the
observed results, assuming the null hypothesis is true. A lower p-value indicates stronger
evidence against the null hypothesis.

31. What is logistic regression?


Logistic regression is a statistical method used for binary classification tasks. It models the
relationship between a dependent variable and one or more independent variables by
predicting the probability that a given input belongs to a particular class, using the logistic
function to output values between 0 and 1.

32. What is feature selection?


Feature selection is the process of selecting a subset of relevant features (variables) from a
larger set of data to improve model performance. It helps in reducing the complexity of the
model, minimizing overfitting, and improving accuracy by focusing on the most important
features.

5-Mark Questions

1. Explain the data science life cycle.

The Data Science Life Cycle is a structured approach to solving data-driven problems. It
involves multiple stages, each contributing to the analysis and understanding of data to derive
meaningful insights. Below are the key stages:

1. Problem Definition
The first step is to clearly define the problem or objective. This stage involves understanding
the business requirements and translating them into a data science problem. The goal is to
determine what questions need to be answered or what predictions need to be made.
2. Data Collection
In this phase, data is gathered from various sources, such as databases, external APIs,
surveys, or sensor data. The data can be structured or unstructured and may come in different
formats. It's crucial to ensure that the collected data aligns with the defined problem.
3. Data Cleaning and Preprocessing
Raw data often contains noise, missing values, or errors. Data cleaning involves identifying
and correcting issues such as missing data, duplicates, or inconsistencies. Preprocessing
includes normalizing, transforming, and encoding data to make it suitable for analysis and
modeling.
4. Data Exploration and Analysis
This stage involves exploring the data using descriptive statistics and visualizations (e.g.,
histograms, scatter plots) to understand its structure and identify patterns or trends.
Exploratory Data Analysis (EDA) is crucial for identifying important features, correlations,
or anomalies in the data.
5. Modeling and Algorithm Selection
Once the data is clean and well-understood, data scientists apply various statistical and
machine learning models. These could include regression, classification, clustering, or other
techniques, depending on the problem. This stage involves training models on the data and
selecting the best algorithm based on evaluation metrics.
6. Model Evaluation
After building the model, it's tested using new data (validation or test set) to assess its
performance. Common evaluation metrics include accuracy, precision, recall, F1-score, or
AUC, depending on the type of problem (classification, regression, etc.).
7. Model Deployment
Once a model has been validated and is performing well, it is deployed to a production
environment where it can make predictions on new, real-time data. Deployment may involve
integrating the model into an application or API for continuous use.
8. Monitoring and Maintenance
After deployment, the model’s performance is continuously monitored to ensure it remains
accurate over time. Data changes, concept drift, or new patterns might require model
retraining or fine-tuning. Regular updates and maintenance are essential for maintaining
model relevance.

2. Describe the types of data: structured, semi-structured, and unstructured. (5 marks)

Data can be categorized into three main types based on its format and organization:
structured, semi-structured, and unstructured. Here's a breakdown of each type:

1. Structured Data

Structured data is highly organized and can be easily stored in a predefined format, such as
rows and columns in databases or spreadsheets. This type of data follows a strict schema,
which means it has clear data types, such as integers, dates, or strings, and is typically stored
in relational databases (e.g., SQL databases).

 Examples: Customer information (name, age, address), sales transactions, and inventory
data.
 Characteristics:
o Well-organized into tables or spreadsheets.
o Easy to query and analyze using SQL or other query languages.
o Can be processed by traditional data processing systems.

2. Semi-structured Data

Semi-structured data doesn’t conform to a strict schema like structured data, but it still
contains some level of organization. It often uses tags or markers to separate data elements,
making it more flexible than structured data. Semi-structured data is commonly found in
formats like XML, JSON, or CSV files, where there is some structure but not as rigidly
defined as in structured data.

 Examples: JSON files, XML documents, email (with metadata and body content), log files,
and NoSQL databases.
 Characteristics:
o Contains a hierarchical or nested structure but doesn’t follow a fixed table format.
o Can store data with varying attributes and sizes.
o Requires more advanced processing and parsing methods than structured data, but is more
flexible and scalable.

3. Unstructured Data

Unstructured data is the most flexible and complex type of data. It does not follow a
predefined data model or format, making it difficult to organize or analyze using traditional
data processing tools. Unstructured data can contain a wide variety of formats, such as text,
images, audio, and video. It often requires specialized tools (e.g., natural language
processing, image recognition) to extract meaningful information.

 Examples: Social media posts, video and audio files, emails, images, and web pages.
 Characteristics:
o Lacks a clear structure or organization.
o Difficult to search and analyze using traditional database tools.
o Requires advanced techniques such as machine learning, text mining, and image processing
to extract insights.

Summary of Differences:

Data Type Structure Examples Tools for Analysis

Rigid, predefined SQL databases, spreadsheets,


Structured SQL, relational databases
schema (tables/rows) and inventory data

Semi- Flexible structure with JSON, XML, email logs, NoSQL NoSQL databases,

structured markers/tags databases custom parsers

Text documents, audio files, NLP, image recognition,


Unstructured No predefined structure
images, and social media machine learning

3. Explain the role of Python libraries in data science. (5 marks)

Python has become one of the most popular programming languages in data science,
primarily because of the powerful libraries it offers for data manipulation, analysis,
visualization, and machine learning. These libraries simplify and speed up the data science
process, making it easier to perform tasks such as data cleaning, exploration, and modeling.
Here’s an explanation of some key Python libraries and their roles in data science:

1. NumPy (Numerical Python)


Role: NumPy is a fundamental library for numerical computing in Python. It provides
support for arrays, matrices, and large multi-dimensional data structures, enabling efficient
manipulation and computation on numerical data.

 Key Functions:
o Provides support for large, multi-dimensional arrays and matrices.
o Offers a wide range of mathematical functions to perform operations on these arrays.
o Enables fast computation, which is crucial when working with large datasets.

2. Pandas

Role: Pandas is a data manipulation and analysis library that simplifies data wrangling,
cleaning, and transformation. It provides data structures like Series (1-dimensional) and
DataFrame (2-dimensional), making it easy to handle and analyze structured data.

 Key Functions:
o Efficient handling of missing data, merging, and reshaping datasets.
o Provides tools for data aggregation, grouping, and summarization.
o Allows reading and writing data to/from various file formats (CSV, Excel, SQL, etc.).

3. Matplotlib

Role: Matplotlib is a data visualization library used to create static, animated, and interactive
plots and graphs. It is essential for visualizing data patterns, distributions, and relationships to
aid in decision-making and storytelling.

 Key Functions:
o Supports various types of plots, such as line plots, bar charts, histograms, scatter plots, etc.
o Customizable visualizations for presentations or publications.
o Enables easy integration with other libraries (e.g., Pandas, NumPy) for enhanced plotting.

4. Seaborn

Role: Seaborn is built on top of Matplotlib and provides a higher-level interface for creating
more attractive and informative statistical graphics. It simplifies the creation of complex
visualizations like heatmaps, time series plots, and regression plots.

 Key Functions:
o Offers a variety of advanced plotting functions (e.g., violin plots, box plots).
o Automatically handles aspects like color palettes and themes for better aesthetics.
o Works seamlessly with Pandas DataFrames to visualize data clearly and insightfully.

5. Scikit-learn

Role: Scikit-learn is a widely used library for machine learning in Python. It provides simple
and efficient tools for data mining and data analysis, with built-in algorithms for
classification, regression, clustering, and model evaluation.

 Key Functions:
o Includes popular machine learning algorithms like decision trees, SVM, k-means, and linear
regression.
o Provides functions for model evaluation and validation (e.g., cross-validation, metrics like
accuracy and F1-score).
o Offers preprocessing tools like feature scaling, encoding categorical variables, and
dimensionality reduction.

6. TensorFlow and Keras

Role: TensorFlow is a deep learning framework developed by Google, and Keras is an open-
source neural network library that runs on top of TensorFlow. These libraries are used for
building, training, and deploying deep learning models.

 Key Functions:
o Provides tools for building and training neural networks, including deep learning models for
image recognition, natural language processing, and more.
o Supports GPU acceleration for faster computation.
o Simplifies the creation of complex models with high-level APIs in Keras.

7. Statsmodels

Role: Statsmodels is a library for statistical modeling and hypothesis testing in Python. It
provides classes and functions for conducting statistical tests, estimating statistical models,
and performing regression analysis.

 Key Functions:
o Allows users to perform OLS (Ordinary Least Squares) regression, time-series analysis, and
ANOVA.
o Provides a wide range of statistical tests, such as t-tests, chi-squared tests, and more.
o Enables the estimation of models for time-series forecasting, linear regression, and
generalized linear models.

8. SciPy

Role: SciPy is a library for scientific and technical computing, building on NumPy. It
provides additional functionality for optimization, integration, interpolation, eigenvalue
problems, and other advanced mathematical tasks.

 Key Functions:
o Includes modules for optimization, integration, and solving differential equations.
o Provides advanced mathematical tools like linear algebra, probability distributions, and
statistical tests.
o Works well alongside NumPy for scientific computing.

9. NLTK (Natural Language Toolkit)

Role: NLTK is a powerful library for working with human language data, including text
mining, sentiment analysis, and natural language processing (NLP) tasks.
 Key Functions:
o Provides tools for tokenizing, stemming, and lemmatizing text.
o Offers algorithms for text classification, part-of-speech tagging, and sentiment analysis.
o Includes a collection of corpora and lexical resources for language processing.

4. What are the steps involved in data cleaning? (5 marks)

Data cleaning is an essential part of the data preprocessing process. It involves identifying
and correcting errors or inconsistencies in the data to ensure its quality and reliability for
analysis or modeling. The steps involved in data cleaning are as follows:

1. Handling Missing Data

 Identify Missing Values: The first step is to detect missing values in the dataset, which could
be represented as NaN, null, or blank cells.
 Impute or Remove Missing Data: Depending on the situation, missing data can be handled
in different ways:
o Imputation: Fill missing values with statistical measures like mean, median, mode, or other
relevant imputation techniques.
o Removal: Remove rows or columns with missing values, especially when the proportion of
missing data is high or imputation is not feasible.

2. Removing Duplicates

 Detect Duplicates: Identify and remove duplicate rows or records that appear more than once
in the dataset, as they can introduce bias into the analysis.
 Remove or Merge: Depending on the context, you can either remove the duplicates or
aggregate the data if needed (e.g., summing or averaging duplicate records).

3. Handling Outliers

 Identify Outliers: Outliers are data points that deviate significantly from the rest of the data
and can distort analysis or modeling results. Methods like boxplots, Z-scores, or IQR
(Interquartile Range) can be used to detect outliers.
 Handle Outliers: Once identified, outliers can be:
o Removed: If they are errors or irrelevant.
o Transformed: By applying transformations (e.g., log or square root) to reduce their effect.
o Capped: Using techniques like winsorization to limit extreme values to a threshold.

4. Standardizing and Normalizing Data

 Standardization: Scale the data to have a mean of 0 and a standard deviation of 1 (z-score
normalization). This is especially important for algorithms that depend on the scale of data
(e.g., linear regression, SVM).
 Normalization: Rescale the data to fit within a specific range, typically 0 to 1. This is useful
for algorithms that require a bounded input space, like neural networks.

5. Correcting Data Inconsistencies


 Fixing Formatting Errors: Ensure consistency in data formatting (e.g., consistent date
formats, standardized text values).
 Standardize Categorical Data: Convert categorical variables to consistent formats (e.g.,
"Yes" and "No" could be standardized to 1 and 0).
 Data Typing: Ensure that data is in the correct type (e.g., strings should be converted to
dates, or numeric values should be correctly typed as integers or floats).

6. Encoding Categorical Data

 Convert Categorical Variables: Many machine learning algorithms require numerical input.
Categorical variables can be encoded into a numeric format using methods like:
o Label Encoding: Assign a unique number to each category.
o One-Hot Encoding: Create binary columns for each category, indicating the presence of
each category.

7. Handling Inconsistent Data Types

 Check Data Types: Ensure that each column has the correct data type (e.g., numeric, string,
date). For example, if a date column is stored as text, it should be converted to a datetime
type for proper analysis.
 Conversion: Convert columns to the appropriate data type to ensure correct computations
and analyses.

8. Dealing with Irrelevant or Redundant Features

 Remove Unnecessary Features: Drop columns that do not contribute useful information to
the analysis, such as IDs or other irrelevant data.
 Feature Selection: Identify and keep only the most important features that are relevant to the
analysis or model to reduce complexity and improve model performance.

9. Data Transformation and Feature Engineering

 Create New Features: Derive new features from existing data (e.g., extracting year or month
from a date field or creating a ratio from two numerical columns).
 Binning: Group numerical values into discrete bins (e.g., age ranges like 0-18, 19-35, etc.) to
make analysis easier.

5. Discuss any three data visualization techniques. (5 marks)

Data visualization is the process of representing data in graphical or pictorial form to help
understand complex data patterns, trends, and relationships. Various visualization techniques
are used depending on the type of data and the insights you wish to extract. Below are three
common data visualization techniques:

1. Bar Chart
 Description: A bar chart is a visual representation of categorical data, where each category is
represented by a bar. The length of the bar corresponds to the value of the data. Bar charts
can be horizontal or vertical.
 Use Cases: Bar charts are typically used for comparing different categories or groups. They
are ideal when you have discrete categories and want to compare their sizes or frequencies.
 Example: A bar chart can be used to show the sales performance of different products over a
given period. Each bar would represent a different product, and the length of the bar would
indicate the sales volume.
 Advantages:
o Easy to read and interpret.
o Suitable for comparisons among categories.
 Disadvantages:
o Not ideal for displaying trends over time or relationships between variables.

2. Line Plot (Line Chart)

 Description: A line plot displays data points on a two-dimensional graph, connected by


straight lines. It is used to represent trends over a continuous range, often over time.
 Use Cases: Line charts are particularly useful for showing time-series data, where the data
points are ordered by time. They help visualize trends, fluctuations, and changes over a
specific period.
 Example: A line chart can be used to track the stock price of a company over several months.
The x-axis would represent time (e.g., months), and the y-axis would represent the stock
price.
 Advantages:
o Excellent for showing trends and changes over time.
o Allows easy comparison of multiple trends.
 Disadvantages:
o Not suitable for discrete or categorical data.
o Can become cluttered when too many lines are plotted.

3. Heatmap

 Description: A heatmap is a data visualization technique that uses color to represent the
intensity of values in a two-dimensional space. Each cell in a heatmap corresponds to a value,
and the color represents the magnitude of the value, with different color intensities indicating
different levels of data.
 Use Cases: Heatmaps are commonly used in correlation matrices, website user behavior
analysis, and geographic data visualizations. They are helpful when you need to show
patterns or correlations between variables, especially when dealing with large datasets.
 Example: A heatmap can be used to show the correlation between various factors, such as
the relationship between temperature and sales in different regions. The cells would be
colored to indicate the strength of the correlation.
 Advantages:
o Good for displaying complex, multi-variable data.
o Allows for quick visual identification of patterns, trends, and outliers.
 Disadvantages:
o May not be suitable for small datasets.
o Color choices can sometimes make it difficult to differentiate values, especially in large
datasets with small differences.
Summary of Data Visualization Techniques:

Visualization
Description Use Case Advantages
Type

Comparing different
Displays categorical data with Easy to read, ideal for
Bar Chart categories (e.g., sales by
rectangular bars comparison
product)

Displays trends over time with Time-series data (e.g., stock Great for trends and
Line Plot
data points connected by lines prices over time) continuous data

Showing correlations or
Uses color gradients to show Highlights patterns,
Heatmap patterns in complex
data intensity in a 2D matrix good for large datasets
datasets

6. Describe the differences between supervised and unsupervised learning. (5 marks)

Supervised and unsupervised learning are two of the primary types of machine learning. They
differ in how the models are trained and the nature of the data used. Below are the key
differences between supervised and unsupervised learning:

1. Definition:

 Supervised Learning: In supervised learning, the model is trained on a labeled dataset,


where the input data is paired with the correct output (target labels). The goal is for the model
to learn a mapping from inputs to outputs so that it can predict the labels for new, unseen
data.
 Unsupervised Learning: In unsupervised learning, the model is trained on data that has no
labeled output. The goal is to find hidden patterns or intrinsic structures in the data, such as
grouping similar data points or reducing the dimensionality of the dataset.

2. Data Requirements:

 Supervised Learning: Requires a labeled dataset, meaning that each input in the training
data is associated with a corresponding output label. For example, a dataset of emails where
each email is labeled as "spam" or "not spam."
 Unsupervised Learning: Does not require labeled data. The model works with input data
that has no associated output or labels. For example, a dataset of customer purchase behaviors
without any pre-assigned labels or categories.
3. Goal:

 Supervised Learning: The goal is to predict the output for new data based on the patterns
learned from the labeled training data. Supervised learning focuses on making predictions
and classifications.
 Unsupervised Learning: The goal is to explore the structure of the data by grouping or
clustering similar data points, or by reducing the data's dimensions. It focuses on discovering
patterns, associations, and relationships within the data.

4. Output:

 Supervised Learning: The output is typically a prediction or a classification based on the


labels from the training data. For example:
o Classification: Identifying whether an email is spam or not.
o Regression: Predicting house prices based on features like size and location.
 Unsupervised Learning: The output is typically a grouping or transformation of the data.
Examples include:
o Clustering: Grouping similar customers based on purchase behavior.
o Dimensionality Reduction: Reducing the number of features in a dataset while retaining
essential patterns (e.g., PCA).

5. Examples of Algorithms:

 Supervised Learning:
o Classification Algorithms: Logistic Regression, Support Vector Machines (SVM), Decision
Trees, Naive Bayes.
o Regression Algorithms: Linear Regression, Ridge Regression, Lasso.
 Unsupervised Learning:
o Clustering Algorithms: K-means, Hierarchical Clustering, DBSCAN.
o Dimensionality Reduction: Principal Component Analysis (PCA), t-SNE, Autoencoders.

6. Evaluation:

 Supervised Learning: Model performance can be evaluated using metrics like accuracy,
precision, recall, F1-score (for classification), or mean squared error (MSE) (for
regression), which compare predicted outputs to actual labels in the test data.
 Unsupervised Learning: Evaluating the model is more challenging since there are no
predefined labels. Metrics like silhouette score, Davies-Bouldin index, or within-cluster
sum of squares are used for clustering tasks. For dimensionality reduction, reconstruction
error or explained variance might be used.

Summary of Differences:
Feature Supervised Learning Unsupervised Learning

Training
Labeled data (input-output pairs) Unlabeled data (only inputs)
Data

Discover patterns or structure in the


Goal Predict outputs based on input data
data

Groupings (clusters) or data


Output Predictions or classifications
transformations

Logistic Regression, SVM, Decision Trees, K-means, DBSCAN, PCA, Hierarchical


Algorithms
Linear Regression Clustering

Based on accuracy, precision, recall, and error Based on clustering quality or


Evaluation
metrics explained variance
7. Classification vs. Regression (5 marks)

Classification and regression are two key types of supervised learning problems, both aiming
to predict an output based on input data. However, they differ in the type of output they
predict and the problems they solve. Below is a brief explanation of each and their
differences:

Classification:

 Definition: Classification is a type of supervised learning problem where the goal is to


predict a discrete label or category for an input. The output is categorical, meaning the model
predicts one of a finite set of possible classes or labels.
 Example: Predicting whether an email is "spam" or "not spam", or predicting whether a
tumor is "malignant" or "benign".
 Output: The output is a class label or category, such as "positive" or "negative", "spam" or
"not spam".
 Algorithms Used: Common classification algorithms include:
o Logistic Regression
o Decision Trees
o Support Vector Machines (SVM)
o Naive Bayes
o K-Nearest Neighbors (KNN)

Regression:

 Definition: Regression is a type of supervised learning problem where the goal is to predict a
continuous numerical value for an input. The output is continuous, meaning the model
predicts a real-valued number.
 Example: Predicting the price of a house based on features like size, location, and number of
rooms, or predicting a person's weight based on their height and age.
 Output: The output is a continuous value, such as the price of a house, temperature, or age.
 Algorithms Used: Common regression algorithms include:
o Linear Regression
o Ridge/Lasso Regression
o Decision Trees for Regression
o Support Vector Regression (SVR)
o Random Forest Regression

Key Differences:

Feature Classification Regression

Output Discrete class labels (categorical) Continuous numerical values


Feature Classification Regression

Example Email classification (spam or not), Predicting house prices, Predicting

Problems Disease diagnosis (malignant or benign) temperature

Evaluation Mean Squared Error (MSE), Mean


Accuracy, Precision, Recall, F1-score
Metrics Absolute Error (MAE), R-squared

Logistic Regression, Decision Trees, KNN, Linear Regression, Decision Trees for
Algorithms
SVM Regression, Random Forest Regression

9. What is Natural Language Processing (NLP)? Give examples. (5 marks)

Definition of Natural Language Processing (NLP):

Natural Language Processing (NLP) is a subfield of artificial intelligence (AI) and


computational linguistics that focuses on enabling machines to understand, interpret, and
generate human language. The primary goal of NLP is to bridge the gap between human
communication and computer understanding, allowing computers to process and analyze
large amounts of natural language data.

NLP combines computational techniques from linguistics, machine learning, and computer
science to process and analyze human languages in a way that is both meaningful and useful.
NLP allows machines to handle text or speech data and perform tasks such as translation,
sentiment analysis, and language generation.

Key Tasks in NLP:

1. Tokenization: The process of splitting a text into smaller units, such as words, sentences, or
phrases. For example, turning the sentence "I love learning NLP" into tokens: ["I", "love",
"learning", "NLP"].
2. Part-of-Speech Tagging: Identifying the parts of speech (such as noun, verb, adjective) for
each word in a sentence. For example, in the sentence "The dog runs fast," the words "dog"
and "runs" would be tagged as a noun and verb, respectively.
3. Named Entity Recognition (NER): Identifying proper names or specific entities like people,
organizations, dates, and locations in text. For example, in the sentence "Apple is
headquartered in Cupertino," "Apple" is recognized as an organization and "Cupertino" as a
location.
4. Sentiment Analysis: Determining the sentiment or emotion expressed in a piece of text,
whether it's positive, negative, or neutral. For example, analyzing the sentence "I love this
movie!" would classify the sentiment as positive.
5. Machine Translation: Translating text or speech from one language to another. For
example, translating the sentence "Bonjour tout le monde" (French) into "Hello everyone"
(English).
6. Text Summarization: Creating a concise summary of a larger body of text. For example,
summarizing a long article into a few sentences that capture the main ideas.
7. Speech Recognition: Converting spoken language into written text. For example, voice
assistants like Siri and Google Assistant use speech recognition to convert voice commands
into text.

Examples of NLP Applications:

1. Chatbots and Virtual Assistants:


o Example: Siri, Alexa, and Google Assistant use NLP to understand spoken commands,
process them, and respond with appropriate actions or information.
o Function: These virtual assistants are capable of performing tasks such as setting reminders,
playing music, or providing weather forecasts by understanding natural language commands.
2. Machine Translation:
o Example: Google Translate uses NLP to translate text or speech from one language to
another.
o Function: Users can input a sentence in English, and Google Translate will output the
translation in multiple languages, such as French or Spanish.
3. Sentiment Analysis:
o Example: Companies use NLP tools to analyze customer feedback, reviews, and social
media posts to gauge public sentiment about products or services.
o Function: NLP algorithms classify text as expressing positive, negative, or neutral
sentiments, helping businesses make data-driven decisions.
4. Text Classification:
o Example: Email filters use NLP to classify emails as spam or not spam based on the content
of the email.
o Function: By analyzing the words and phrases in the email, NLP models classify whether the
message is legitimate or a promotional/spam message.
5. Named Entity Recognition (NER):
o Example: In news articles, NLP systems can identify names of people, locations, dates, and
organizations. For example, "Elon Musk is the CEO of Tesla" would recognize "Elon Musk"
as a person and "Tesla" as an organization.
o Function: This helps in information extraction and understanding the key elements within
text.
6. Text Summarization:
o Example: News aggregators like Google News use NLP to automatically generate
summaries of articles and news stories.
o Function: The system can create a summary that captures the main points of an article,
making it easier for readers to grasp the key content quickly.
7. Speech-to-Text (Transcription):
o Example: Automatic transcription services like Otter.ai convert spoken words into text.
o Function: This is useful for transcribing meetings, lectures, or podcasts into text format,
making the content easier to search and analyze.
10Explain the steps involved in building a machine learning model.

1. Problem Definition

You first need to understand and define what problem you are trying to solve. Is it to predict
something (like house prices) or classify something (like whether an email is spam)?

2. Data Collection

Collect the data that will help solve your problem. This data can come from many sources,
like databases, websites, or files.

3. Data Preprocessing

Clean the data by fixing any errors or missing values. You also need to make sure it's in a
format that the machine can understand. This step may include converting text to numbers or
normalizing values.

4. Data Splitting

Split your data into two parts:

 Training data: Used to teach the model.


 Testing data: Used to check how well the model performs after it has learned.

5. Model Selection

Choose a machine learning algorithm (like a decision tree, linear regression, etc.) that is
suitable for the problem you're solving.

6. Model Training

Use the training data to "train" the model. This is where the model learns patterns and
relationships in the data.

7. Model Evaluation

Check how well the model is doing by using the testing data. Measure its performance with
metrics like accuracy, precision, or error rate.

8. Hyperparameter Tuning

Adjust the settings (called hyperparameters) of the model to make it perform better. You can
test different values for these settings to find the best one.

9. Model Testing

Test the model on new, unseen data (often called a "test set") to make sure it's not just
memorizing the training data but can generalize to new situations.
10. Model Deployment

Once the model works well, you can deploy it so it can start making predictions on new, real-
world data (e.g., recommending products on an e-commerce website).

11. Model Monitoring

After deployment, keep an eye on the model's performance. If it starts to perform poorly, you
might need to retrain it or make adjustments.

10 Marks Questions

1. Elaborate on the applications of data science in business decision-making.


(10 marks)

Data science plays a powerful role in helping businesses make better, faster, and smarter
decisions. By analyzing large volumes of data, businesses can gain insights that were
previously hidden and use them to improve operations, serve customers better, and increase
profits.

Here are some key applications of data science in business decision-making:

1. Customer Insights and Personalization

 Use: Businesses analyze customer behavior, preferences, and feedback to understand what
customers want.
 Example: E-commerce companies like Amazon use data science to recommend products
based on your browsing and purchase history.
 Benefit: Increases customer satisfaction and sales through personalized experiences.

2. Marketing and Advertising

 Use: Data science helps in targeting the right audience with the right message at the right
time.
 Example: Facebook and Google use data science to show personalized ads to users based on
their interests and online activity.
 Benefit: Reduces marketing costs and improves campaign effectiveness.

3. Sales Forecasting

 Use: Companies use historical sales data to predict future demand.


 Example: A retail store can predict how many items will sell next month and adjust
inventory accordingly.
 Benefit: Helps in better planning, stocking, and avoiding overproduction or shortages.
4. Risk Management and Fraud Detection

 Use: Data science identifies unusual patterns or behavior that might indicate fraud or risk.
 Example: Banks use data science to detect suspicious transactions and prevent credit card
fraud.
 Benefit: Reduces financial losses and increases trust in services.

5. Product Development

 Use: Businesses analyze customer feedback and market trends to design or improve products.
 Example: Netflix uses viewing data to decide what types of shows or movies to produce
next.
 Benefit: Ensures products meet customer needs and succeed in the market.

6. Operational Efficiency

 Use: Data science helps businesses streamline processes and reduce costs.
 Example: Manufacturing companies use data to monitor machine performance and predict
maintenance needs.
 Benefit: Reduces downtime and increases productivity.

7. Pricing Strategy

 Use: Companies use pricing data, customer behavior, and market trends to decide the best
price for products.
 Example: Airlines and ride-sharing services like Uber change prices based on demand using
data science models.
 Benefit: Maximizes revenue without losing customers.

8. Human Resources and Talent Management

 Use: Data science helps HR teams to find the best candidates, predict employee turnover, and
improve retention.
 Example: Companies analyze employee performance data to plan promotions and training.
 Benefit: Builds stronger teams and reduces hiring costs.

9. Supply Chain and Logistics

 Use: Businesses use data to optimize delivery routes, manage inventory, and reduce shipping
time.
 Example: Companies like Amazon and FedEx use data science for fast, efficient deliveries.
 Benefit: Improves service and reduces operational costs.

10. Strategic Business Planning

 Use: Executives use data-driven insights to make long-term decisions about entering new
markets, launching products, or changing direction.
 Example: A retail brand might use market analysis to decide whether to expand into a new
country.
 Benefit: Increases chances of success and reduces risks.
1. Describe in detail the process of exploratory data analysis using Python.
2. Discuss the role of statistical techniques in data analytics with examples.

3. Discuss the role of statistical techniques in data analytics with examples. (10
marks)

Statistical techniques are the foundation of data analytics. They help transform raw data
into meaningful insights by summarizing, analyzing, and interpreting it. These methods allow
businesses, researchers, and data scientists to understand patterns, relationships, trends, and
variability in data.

Key Roles of Statistical Techniques in Data Analytics:

1. Data Summarization

Role:
Statistics help summarize large data sets using measures like:

 Mean (average)
 Median (middle value)
 Mode (most frequent value)
 Standard deviation (spread or variability of data)

Example:
An e-commerce company uses average purchase value to understand customer spending
habits.

2. Data Distribution and Probability

Role:
Understanding how data is spread using distributions (e.g., normal distribution) helps predict
future outcomes.

Example:
In quality control, factories use probability distributions to model product defect rates and
plan inspections.

3. Hypothesis Testing

Role:
Used to make decisions or inferences about a population from a sample. Helps test
assumptions or claims using p-values and confidence intervals.

Example:
A company wants to test if a new marketing campaign increases sales. Hypothesis testing can
determine if observed sales growth is statistically significant.
4. Regression Analysis

Role:
Used to understand relationships between variables. Helps predict one variable based on
others.

Example:
A real estate company uses linear regression to predict house prices based on size, location,
and number of bedrooms.

5. Classification and Grouping

Role:
Statistics support decision trees and logistic regression, which classify data into categories.

Example:
A bank uses logistic regression to predict whether a customer is likely to default on a loan
(yes or no).

6. Correlation Analysis

Role:
Helps measure how strongly two variables are related (positive or negative correlation).

Example:
A health researcher checks if there’s a correlation between exercise hours and weight loss.

7. Outlier Detection

Role:
Outliers (extreme values) can distort analysis. Statistical methods like z-scores or IQR are
used to detect and handle them.

Example:
A credit card company flags transactions that are far from a customer’s normal spending
pattern.

8. Sampling Techniques

Role:
Instead of analyzing the entire population, data analysts use random samples to make
estimates.

Example:
Survey companies sample a portion of the population to predict election results.
9. Time Series Analysis

Role:
Used to analyze data collected over time (e.g., monthly sales), helping identify trends and
seasonal patterns.

Example:
A retailer uses time series analysis to forecast sales during holidays.

10. Bayesian Analysis

Role:
Incorporates prior knowledge into current data analysis. Useful when data is limited.

Example:
In spam detection, Bayesian methods calculate the probability that an email is spam based on
certain keywords.

Q4. Compare and contrast supervised and unsupervised learning with


suitable examples. (10 marks)

Aspect Supervised Learning Unsupervised Learning

The model is trained on labeled data The model is trained on unlabeled data
Definition
(input with known output). (no predefined output).

To predict outcomes or classify data To find patterns, groups, or structure in


Goal
based on past learning. the data.

Yes, outputs (target values) are known No, the model must find structure on its
Output Known?
during training. own.

Examples of - Classification (spam or not spam) - - Clustering - Association (e.g., market

Tasks Regression (predict price) basket analysis)

- Linear Regression - Logistic Regression - - K-Means Clustering - Hierarchical


Algorithms Used
Decision Trees Clustering - PCA
Aspect Supervised Learning Unsupervised Learning

Predicting whether a customer will buy Grouping customers into segments


Example
or not based on age, income based on behavior

Human More, since labels are required for Less, as it explores the data without

Involvement training. predefined labels.

No direct accuracy metric — evaluated


Evaluation Accuracy, precision, recall, etc.
using silhouette score, etc.

✅ Example of Supervised Learning:

Loan Approval Prediction


A bank uses customer data (income, age, job type) along with known outcomes (loan
approved or not) to train a model that predicts loan approval for new applicants.

✅ Example of Unsupervised Learning:

Customer Segmentation
An e-commerce company groups its users into clusters based on purchase history and
browsing behavior, without knowing their exact categories — useful for personalized
marketing.

3. Discuss the process of sentiment analysis using NLP techniques.

Sentiment Analysis Using NLP Techniques

Definition:
Sentiment Analysis (also called Opinion Mining) is a Natural Language Processing (NLP)
technique used to determine whether a piece of text expresses a positive, negative, or
neutral sentiment.

It is widely used in marketing, customer service, product reviews, and social media
analysis.

Process of Sentiment Analysis:


1. Data Collection

 Gather text data from sources like social media, product reviews, surveys, etc.

2. Text Preprocessing

Clean and prepare the text using NLP techniques:

 Tokenization – Split text into words or sentences.


 Lowercasing – Convert all text to lowercase.
 Removing stopwords – Eliminate common words like “the”, “is”, “and”.
 Stemming/Lemmatization – Reduce words to their root form (e.g., "running" → "run").
 Punctuation & noise removal – Clean unwanted symbols or numbers.

3. Feature Extraction

Convert text into numerical format:

 Bag of Words (BoW)


 TF-IDF (Term Frequency-Inverse Document Frequency)
 Word Embeddings (like Word2Vec, GloVe)

4. Model Building

Train a machine learning model using labeled data:

 Algorithms used: Logistic Regression, Naive Bayes, SVM, or deep learning (LSTM,
BERT).
 The model learns to classify the sentiment based on features extracted.

5. Prediction and Evaluation

 Use the model to classify new text.


 Evaluate performance using metrics like accuracy, precision, recall, and F1-score.

Example Use Case:

A company analyzes tweets about its brand. Based on the results:

 Positive feedback is used for marketing.


 Negative feedback helps improve products/services.

4. Explain how to evaluate machine learning models using the confusion matrix, precision,
recall, and F1 score.

Evaluating Machine Learning Models


To evaluate the performance of classification models (e.g., spam detection, churn prediction),
we use metrics like Confusion Matrix, Precision, Recall, and F1 Score.

1. Confusion Matrix

It is a 2x2 table for binary classification problems:

Predicted: Yes Predicted: No

Actual: Yes True Positive (TP) False Negative (FN)

Actual: No False Positive (FP) True Negative (TN)

It helps to visualize the performance of the model.

2. Precision

Precision=TPTP+FP\text{Precision} = \frac{TP}{TP + FP}

 It answers: Of all the predicted positives, how many are actually positive?
 High precision = fewer false positives.
 Important in applications like fraud detection.

3. Recall

Recall=TPTP+FN\text{Recall} = \frac{TP}{TP + FN}

 It answers: Of all actual positives, how many did we correctly identify?


 High recall = fewer false negatives.
 Crucial in cases like disease detection.

4. F1 Score

F1 Score=2×Precision×RecallPrecision+Recall\text{F1 Score} = 2 \times \frac{Precision \times

Recall}{Precision + Recall}

 It balances both precision and recall.


 Useful when you need a single metric for imbalanced datasets.

Example Scenario:

For an email spam filter:


 TP: Spam correctly identified
 FP: Good email marked as spam
 FN: Spam not detected
 TN: Good email correctly passed

5. Describe a case study or real-world application where data science led to improved decision-
making or business outcomes.

Case Study: Netflix's Recommendation System

Background:
Netflix, a global leader in streaming media, relies heavily on data science to enhance user
experience and drive business outcomes. One of the most impactful applications of data
science at Netflix is its recommendation system, which personalizes content suggestions for
users.

Problem:

Netflix has millions of users worldwide, each with unique tastes and viewing patterns. The
challenge was to deliver relevant content to each user, keeping them engaged and reducing
churn (users leaving the platform). Without a proper recommendation system, users would
struggle to find content they enjoy, leading to lower user retention.

Data Science Solution:

1. Data Collection:
o Netflix collects vast amounts of data on user behavior, including:
 What movies or shows users watch
 How much time they spend watching specific content
 User ratings (thumbs up, thumbs down)
 User interactions (pause, rewind, skip)
 Search queries
2. Data Processing and Feature Engineering:
o The data is cleaned and transformed to create meaningful features, such as:
 Viewing history
 Genres or categories of watched content
 Time spent watching and engagement metrics
3. Recommendation Algorithms:
o Netflix uses a combination of Collaborative Filtering and Content-Based Filtering:
 Collaborative Filtering: This technique identifies patterns in user behavior and recommends
items based on what similar users have watched. For example, "Users who watched X also
watched Y."
 Content-Based Filtering: This method recommends content similar to what a user has
watched in the past, using attributes like genre, actors, or directors.
o Hybrid Models: Netflix combines these techniques to improve accuracy and overcome
limitations of each approach.
4. Machine Learning Models:
o Netflix uses machine learning algorithms, such as matrix factorization and deep learning,
to fine-tune its recommendations and adapt to changes in user preferences over time.

Business Outcomes:

1. Increased User Engagement:


o By offering personalized content recommendations, Netflix kept users engaged longer,
increasing the time spent on the platform.
o In fact, Netflix claims that over 80% of the content watched on its platform comes from
recommendations.
2. Improved Retention:
o The recommendation engine reduced churn by keeping users engaged with relevant content.
o Personalization enhanced the user experience, leading to higher subscription renewal rates.
3. Better Content Acquisition:
o Netflix used data-driven insights to acquire content that aligns with user preferences, making
its library more attractive.
o It also led to the creation of Netflix Originals, as data showed increasing demand for specific
genres or types of shows.
4. Optimized Marketing:
o Netflix tailored marketing campaigns by analyzing viewing patterns, enabling targeted ads
and personalized messaging, further boosting engagement.

You might also like