0% found this document useful (0 votes)
62 views21 pages

Fiches Machine Learning

This document provides an overview of machine learning concepts including supervised vs unsupervised learning, classification vs regression problems, and modeling techniques like linear regression. It discusses modeling processes such as data preprocessing, model training/tuning, and evaluation. Clustering algorithms like k-means are introduced, as well as metrics to measure clustering quality. Linear regression fundamentals such as hypothesis testing, confidence intervals, and R-squared are also summarized.

Uploaded by

Rhysand Re
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
62 views21 pages

Fiches Machine Learning

This document provides an overview of machine learning concepts including supervised vs unsupervised learning, classification vs regression problems, and modeling techniques like linear regression. It discusses modeling processes such as data preprocessing, model training/tuning, and evaluation. Clustering algorithms like k-means are introduced, as well as metrics to measure clustering quality. Linear regression fundamentals such as hypothesis testing, confidence intervals, and R-squared are also summarized.

Uploaded by

Rhysand Re
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 21

Machine Learning with Python

Application:
Assumptions: predict the future based on the past

Data availability: you cannot build a model without data

Modeling technique: big/small dataset

Result: reliability/interpretation

Supervised & Unsupervised learning


Supervised & Unsupervised learning
The main difference is on the output

- A supervised models uses labeled data to help predict outcomes


- An unsupervised model does not

Supervised
Supervised learning is a machine learning approach that’s defined by its use of labeled datasets.

These datasets are designed to train or ‘supervise’ algorithms into classifying data or predicting
outcomes accurately.

- The model can measure its accuracy and learn over time
- Two types of supervised problems: classification and regression
Classification problems use an algorithm to assign test data into specific categories, such as separating
apples from oranges.

Regression models use an algorithm to predict numerical values based on different data points, such as
sales revenue projections for a given business.

Classification models : ▪ Logistic Regression ▪ K-Nearest Neighbours ▪ Support Vector Machines ▪ Kernel
SVM ▪ Naïve Bayes ▪ Decision Tree classifier ▪ Random Forests classifier ▪ Neural Network classifier

Regression models: • Simple Linear Regression • Multiple Linear Regression • Polynomial Regression •
Support Vector Regression • Decision Tree • Random Forests • Neural Network

Unsupervised
Unsupervised learning uses machine learning algorithms to analyse and cluster unlabeled data sets.
These algorithms discover hidden patterns in data without the need for human intervention (hence,
they are “unsupervised”).

Unsupervised learning models are used for three main tasks:

• Clustering: for grouping unlabeled data based on their similarities or differences e.g., K-means
clustering algorithms Frequently used for market segmentation, image compression, etc.

• Association: uses different rules to find relationships between variables in a given dataset. These
methods are frequently used for market basket analysis and recommendation engines, along the lines of
“Customers Who Bought This Item Also Bought” recommendations.

• Dimensionality reduction: a learning technique used when the number of features (or instances) in a
given dataset is too high. It reduces the input data to a manageable size while also preserving the data
integrity. Often, this technique is used in the preprocessing data stage, such as when autoencoders
remove noise from visual data to improve picture quality.
Defining your problem and variables

Continuous variable: Any number (bw interval), real numbers ℛ

Discrete variable: Finite number of possible outcomes, natural (counting) numbers

Output numerical: regression VS Output categorical: classification

The Modelling Process


A general data mining process
Raw data: (Selection, Preprocessing, Transformation)

Data: Data mining

Patterns: Evaluation, Interpretation

Knowledge

A general modelling process


Given Data

Model Training (estimating parameters)

Tuning parameters

Model evaluation

Model selection / Model comparison

Modelling process = training, validation and test


Descriptive analysis
Visualisation
Boxplots

Clustering
Supervised Vs Unsupervised
Predictive Model: 𝑦 = 𝑓(𝑋1 ,𝑋2 , … , 𝑋𝑝)

• The majority of practical machine learning uses supervised learning.

• Supervised learning is where you have input data (X) and output data (y) and you use an algorithm to
learn the mapping function 𝑓 from the input to the output.

• Classification when 𝑦 is categorical Regression when 𝑦 is continuous

• E.g. linear regression, random forest, neural network, support vector machine…

▪ Unsupervised learning is where you only have input data (X) and NO corresponding output data.

▪ The goal for unsupervised learning is to model the underlying structure/pattern in the data.

▪ Clustering to discover the inherent groupings, e.g. customer segmentation

▪ Association to discover rules that describe large portions of your data, e.g. people that buy A also tend
to buy B.
▪ K-means, Apriori…

Distance functions and clustering


Clustering identifies clusters within the dataset of similar items. We use distance measures to create
clusters that can quantify how far away observations are from each other.

This distance sums up the straight distances of the dimensions along their axis.

▪ Clustering methods can only apply to numerical variables.

▪ For Categorical, convert to numeric data/use similarity through overlap (how many values overlap
between two observations) → not recommended

K-means clustering algorithm


1. It starts initially by randomly selecting 𝑘 different observations to act as centroids for the 𝑘 clusters
we want to find.

2. Next, assign each observation to the cluster which the centroid is the closest to the observation in
terms of the Euclidean distance.

3. Then, re-calculates the centroid of the cluster by taking the multidimensional mean of all
observations in the current cluster.

4. Repeat Step 2 and Step 3, until maximum iterations has reached, or the clusters don’t change
anymore.

We cannot know how many clusters 𝑘 in priori. ➔ try different values

• In practice, it is computationally fast, but may suffer the ‘curse of dimensionality’ when the number of
dimensions are high.

• The algorithm does not guarantee convergence to the global optimum.

• The result may depend on the initial clusters. Can be either be random or based on dissimilarity to
achieve faster convergence.

• Variants of the k-means algorithm: k-medians algorithm, algorithm based on Manhattan distance,…

How to measure the clustering quality? elbow method


inertia, or within-cluster sum-of-squares criterion
the distance of each observation to its closest centroid
Regression

Simple linear regression


A univariate linear regression model is to find a line in the two-dimensional space created by the single
independent and dependent variable, which is closest to all data points.

Slope*

Varying the values 𝛽0 and 𝛽1 obtain different distances of the observations to the curve

The goal of fitting a linear regression model is to estimate the two parameters, the slope 𝛽1 and the
intercept 𝛽0 , that minimizes the overall distance.

Since we estimate 𝛽0and 𝛽1 , we write the estimated regression


The values y1^ of the response variable returned by the estimated regression line are called the
predicted (or fitted) values.

In general, 𝑦1^ differ from the actual observed values for the response (𝑦𝑖 ) and the difference between
the two are called residuals

𝑒𝑖 = 𝑦𝑖 – 𝑦i^

We identify as the (estimated) linear relationship between 𝑦 and 𝑥 the particular line that fits the data at
best. The approach usually adopted to find the “best-fit” line is called the least squares method.

The idea of least squares: among all possible lines that pass through the points in the scatterplot, the
best one is the line that minimizes the sum of squared residuals (which represents a measure of the
overall prediction error):
Hypothesis testing
The null hypothesis 𝐻0: 𝛽1 = 0 VS the alternative hypothesis H1: 𝛽1 ≠ 0

• Calculate some statistics and corresponding p-value

• P-values are used in hypothesis testing to help decide whether to reject the null hypothesis.

• The smaller the p-value, the more likely you are to reject the null hypothesis.

• The most common threshold is p < 0.05, which means that the data is likely to occur less than 5% of
the time under the null hypothesis.

• If p < 0.05, we reject the Null hypothesis at the 5% significance level

t and P>|t|:

They tell us something about whether or not the relationship between the predictor and the response is
significant.

A hypothesis test: the null hypothesis 𝑯𝟎: 𝜷𝟏 = 𝟎 VS the alternative hypothesis 𝐇𝟏: 𝜷𝟏 ≠ 𝟎
▪ To test the null hypothesis, we need to determine if our estimate ෢𝛽1 is sufficiently far from zero that
we can be confident that 𝛽1 is non-zero.

▪ To test this hypothesis, linear regression performs a t-test and the outputs of this test are the tstatistic
(t) and the p-value (P>|t|).

▪ If the t-statistic is very large, the alternative hypothesis is likely to be true. A basic rule to reject the null
hypothesis in favour of the alternative hypothesis is when the p-value is smaller than 0.05.

This would mean that the probability of observing such an extreme value for 𝛽1 is 5%, given that the
null hypothesis is true. Hence, it is very unlikely that the null hypothesis is true and we can say that the
predictor has a significant influence on the 5% significance level.

95% Confidence Interval [0.025 and 0.975]:

Strictly speaking a 95% confidence interval means that if we were to take 100 different samples
(datasets) and compute a 95% confidence interval from each sample, then approximately 95 of the 100
confidence intervals will contain the ‘true’ value.

The R-squared (R 2 ) is a measure of goodness-of-fit and shows us the explanatory power of our model.
R 2 = 𝑇𝑆𝑆 − 𝑅𝑆𝑆 / 𝑇𝑆𝑆 = 1 − 𝑅𝑆𝑆 /𝑇𝑆𝑆

RSS represents the residual sum of squares 𝑅𝑆𝑆 = 𝑆𝑆𝑅 = sum( 𝑦𝑖 – y𝑖^)^2 -- variability that is
unexplained by the model.

TSS stands for the total sum of squares and is defined as

variability in the response before the model is built.

The R^2 indicates how much variance in the response variable (Y) is explained by the model.

• The R^2 is a number between 0 and 1. The closer to 0 (1) means that the model does not explain
(explains) a lot of the variability in the response.

• A low value for the R 2 → the relationship might not be linear.

• In simple linear regression, the R^2 equals the squared correlation between predictor and response

Holdout data
Training data

• Takes care to model

• The other part of the data is ignored to build the model

Test data, external, holdout data

• Evaluate the performance of the model

• Compare with other models

• Model validation
Multiple Linear Regression
Multiple linear regression n enables you to predict a continuous response using two or more continuous
or discrete predictors.

Recall that the simple linear regression: 𝑌 = 𝛽0 + 𝛽1𝑋 + 𝜖

The basic equation that defines the multiple linear regression (MLR) model is

𝑌 = 𝛽0 + 𝛽1𝑋1 + 𝛽2𝑋2 + ⋯ + 𝛽𝑝𝑋𝑝 + 𝜖

• The structural part of the model 𝛽0 + 𝛽1𝑋1 + 𝛽2𝑋2 + ⋯ + 𝛽𝑝𝑋𝑝

• The error part of the model 𝜖

• The j coefficients 𝛽𝑗 are called the partial slopes of the response variable with respect to the j-th
predictor 𝑥𝑗.

Estimation of 𝛽𝑗 ’s is still performed using least squares and all the quantities we introduced in simple
linear regression model are used in multiple regression in a similar way.

Specifically, the principle to find the optimal coefficient estimates remains the same as in simple linear
regression, namely minimising the residual sum of squares (RSS):

Once we have the estimates of the coefficient, we can make a prediction for a single instance 𝒙𝒊 = 𝑥𝑖1,
𝑥𝑖2, … , 𝑥𝑖𝑝 using the equation

In multiple linear regression, the coefficient 𝛽^ 𝑗 would represent the average effect of one unit increase
in 𝑥𝑗 while keeping all other predictors fixed.
MLR: OLS
SLR: box office sales depend on advertising spending

MLR: box office sales depend on economic growth and advertising spending 𝑌 = 𝛽0 + 𝛽1𝑋1 + 𝛽2𝑋2 +
epsilon

MLR OLS solution will choose the plane that is as close as possible to the data points, so it will minimise
the squared vertical distances between each observation and the plane.

Interpretation
The coefficients are different compared to SLR.

The coefficient of advertising is saying that, with one unit increase in spending, the visitor count would
increase by 1.54 on average, while keeping economic growth constant.

The OLS procedure and corresponding interpretation can easily be extended to 3,4, or 𝑝 predicators.

Assumptions
𝑌 = 𝛽0 + 𝛽1𝑋1 + 𝛽2𝑋2 + ⋯ + 𝛽𝑝𝑋𝑝 + epsilon

Assumption 1: linear regression assumes that there is a linear relationship between predictor and
response.
To test whether or not there is a linear relationship between a predictor and a response, you can make a
scatterplot between each predictor and the response. If the relationship is linear, the outcome variable
will linearly increase or decrease in terms of the response. Another option is to make a residual plot.
This plot depicts the residuals against the fitted values. If there is a linear relationship, the points should
be randomly distributed around the horizontal line.

Assumption 2: Multiple linear regression assumes that there is NO multicollinearity in the data.

This means that the predictors cannot be too closely related to each other or, in others words, that the
correlation between the predictors is not too high. The easiest, and most effective, way to deal with
multicollinearity is to delete highly correlated variables.

Two ways to check for multicollinearity:

1. calculate the Pearson correlation matrix among all predictors. This correlation matrix shows you the
correlation between all predictors and a rule of thumb is that the correlation between two predictors
should be smaller than 0.80.

2. A second way is to calculate the Variance Inflation Factor (VIF). The VIF for the j-th predictor in a
linear regression model with p predictors is defined as

𝑉𝐼𝐹𝑗 = 1 /1 – 𝑅𝑗^2

where 𝑅𝑗^2 is the R-squared value for the (auxiliary) regression of the j-th predictor versus all the
remaining ones.

The VIF provides a measure of the reduction in the precision of a coefficient estimate. VIF is a number
greater than or equal to 1.

• When it is equal to 1, it means that the predictor is not affected by any collinearity problem.

• The larger it is, the stronger is the association of that predictor with all the other ones in the model. •
If the VIF of a predictor is larger than 10, we flag the predictor as affected by a severe collinearity
problem.

Assumption 3: Multivariate normality

Multiple regression assumes that the residuals are normally distributed.

• Tests for Normality

- Q-Q-plots or simple histograms.


- Goodness-of-fit test, e.g., a Kolmogorov-Smirnov test on the residuals. The null hypothesis of K-S
test is that the data is normally distributed.
• If there is no multivariate normality, then a transformation of the variables (e.g., a logarithmic
transformation) is advised.

Assumption 4: residuals are independent

• In other words, there is no autocorrelation between the error terms. Independent means that the
error term of a certain observation i does not say anything about the error term of observation j.

• Autocorrelation can be tested with a scatterplot or with an autocorrelation test, e.g. Durbin-
Watson test. The null hypothesis for this test is that there is no linear autocorrelation between the
error terms.

Assumption 5: homoscedasticity

Constant variance of the error terms across all observations.

• Can be identified with a residual plot (Residual vs Fitted)

• If the error terms are homoscedastic, we see a chaotic scatterplot of the error terms with no real
relationship. We don’t notice a changing spread of the error terms, instead the spread of the error
terms is constant and we have a constant variance.

• If there is no homoscedasticity (or heteroscedasticity), we see a pattern in the scatter plot. Often
this pattern has a funnel shape and this represents a changing spread of the error terms.
Why feature selection?
Many Machine Learning problems involve thousands or even millions of features for each training
instance. Not only does this make training extremely slow, it can also make it much harder to find a
good solution. This problem is often referred to as the curse of dimensionality.

• Sparsity of data occurs when moving to higher dimensions, which leads to weak statistical significance.
– need more data

• The model parameters to be estimated and training time increase as number of features increases. –
computational burden

Garbage in, garbage out


Facing largre dataset, we want to reduce the dimensionality in the training dataset, and to come up with
a good set of features to train on. This process, called feature engineering, involves:

• Feature selection: selecting the most useful features to train on among existing features.

• Feature extraction: combining existing features to produce a more useful one.

• Creating new features by gathering new data

Feature Selection Techniques


Three types of FS techniques

1. Filter methods

To find a single measure that relates each independent variable to the dependent variable, which
can be used to score the importance of the independent variable, e.g. Correlation. The outcome can
be used for ranking. computationally fast and model-free

2. Wrapper methods
to generate a vast number of models with various subsets of independent variables to check which
subset give the best performance. Advantages: testing various combinations of predictors;
interaction taken into account Drawback: Model specific and computationally expensive.

3. Embedded methods

Some algorithms have built-in feature selection thanks to their way of modelling, e.g. Random
Forest and Lasso regression.

Embedded method - standardized coefficients


𝑌 = 𝛽0 + 𝛽1𝑋1 + 𝛽2𝑋2 + ⋯ + 𝛽𝑝𝑋𝑝 + epsilon

Standardized (regression) coefficients (beta coefficients or beta weights)

where 𝑠𝑖 and 𝑠𝑌 are the (estimated) standard deviations of 𝑋𝑖 and 𝑌, respectively 𝛽𝑖 ∗ refer to how
many standard deviations a dependent variable will change, per standard deviation increase in the
predictor variable The coefficients 𝛽𝑖 ∗ are independent of the involved variables' units of measurement
(unitless).

Filter Method – Correlation Coefficient


The most widely known is the correlation coefficient, more precisely, the Pearson correlation coefficient
𝑟:

𝑟 ∈ [−1,1]: 1 (-1) for perfect positive (negative) correlation

It captures the strength of the linear relationship between 𝑥 and 𝑦

Filter Method – Rank correlation coefficient


The Spearman’s Rank correlation coefficient 𝑅:

where 𝑑𝑖 = 𝑅 𝑥𝑖 − 𝑅(𝑦𝑖 ) is the difference in ranks between the rank of each observation

𝑅 𝑥𝑖 and 𝑅(𝑦𝑖 )

• Non-parametric
• Test for linear relationships

• Suitable for both ordinal and continuous variables

Correlation vs Causation
Correlation may not necessarily leads to causality. One variable’s influence might be overshadowed by
the others.

You might also like