Fiches Machine Learning
Fiches Machine Learning
Application:
Assumptions: predict the future based on the past
Result: reliability/interpretation
Supervised
Supervised learning is a machine learning approach that’s defined by its use of labeled datasets.
These datasets are designed to train or ‘supervise’ algorithms into classifying data or predicting
outcomes accurately.
- The model can measure its accuracy and learn over time
- Two types of supervised problems: classification and regression
Classification problems use an algorithm to assign test data into specific categories, such as separating
apples from oranges.
Regression models use an algorithm to predict numerical values based on different data points, such as
sales revenue projections for a given business.
Classification models : ▪ Logistic Regression ▪ K-Nearest Neighbours ▪ Support Vector Machines ▪ Kernel
SVM ▪ Naïve Bayes ▪ Decision Tree classifier ▪ Random Forests classifier ▪ Neural Network classifier
Regression models: • Simple Linear Regression • Multiple Linear Regression • Polynomial Regression •
Support Vector Regression • Decision Tree • Random Forests • Neural Network
Unsupervised
Unsupervised learning uses machine learning algorithms to analyse and cluster unlabeled data sets.
These algorithms discover hidden patterns in data without the need for human intervention (hence,
they are “unsupervised”).
• Clustering: for grouping unlabeled data based on their similarities or differences e.g., K-means
clustering algorithms Frequently used for market segmentation, image compression, etc.
• Association: uses different rules to find relationships between variables in a given dataset. These
methods are frequently used for market basket analysis and recommendation engines, along the lines of
“Customers Who Bought This Item Also Bought” recommendations.
• Dimensionality reduction: a learning technique used when the number of features (or instances) in a
given dataset is too high. It reduces the input data to a manageable size while also preserving the data
integrity. Often, this technique is used in the preprocessing data stage, such as when autoencoders
remove noise from visual data to improve picture quality.
Defining your problem and variables
Knowledge
Tuning parameters
Model evaluation
Clustering
Supervised Vs Unsupervised
Predictive Model: 𝑦 = 𝑓(𝑋1 ,𝑋2 , … , 𝑋𝑝)
• Supervised learning is where you have input data (X) and output data (y) and you use an algorithm to
learn the mapping function 𝑓 from the input to the output.
• E.g. linear regression, random forest, neural network, support vector machine…
▪ Unsupervised learning is where you only have input data (X) and NO corresponding output data.
▪ The goal for unsupervised learning is to model the underlying structure/pattern in the data.
▪ Association to discover rules that describe large portions of your data, e.g. people that buy A also tend
to buy B.
▪ K-means, Apriori…
This distance sums up the straight distances of the dimensions along their axis.
▪ For Categorical, convert to numeric data/use similarity through overlap (how many values overlap
between two observations) → not recommended
2. Next, assign each observation to the cluster which the centroid is the closest to the observation in
terms of the Euclidean distance.
3. Then, re-calculates the centroid of the cluster by taking the multidimensional mean of all
observations in the current cluster.
4. Repeat Step 2 and Step 3, until maximum iterations has reached, or the clusters don’t change
anymore.
• In practice, it is computationally fast, but may suffer the ‘curse of dimensionality’ when the number of
dimensions are high.
• The result may depend on the initial clusters. Can be either be random or based on dissimilarity to
achieve faster convergence.
• Variants of the k-means algorithm: k-medians algorithm, algorithm based on Manhattan distance,…
Slope*
Varying the values 𝛽0 and 𝛽1 obtain different distances of the observations to the curve
The goal of fitting a linear regression model is to estimate the two parameters, the slope 𝛽1 and the
intercept 𝛽0 , that minimizes the overall distance.
In general, 𝑦1^ differ from the actual observed values for the response (𝑦𝑖 ) and the difference between
the two are called residuals
𝑒𝑖 = 𝑦𝑖 – 𝑦i^
We identify as the (estimated) linear relationship between 𝑦 and 𝑥 the particular line that fits the data at
best. The approach usually adopted to find the “best-fit” line is called the least squares method.
The idea of least squares: among all possible lines that pass through the points in the scatterplot, the
best one is the line that minimizes the sum of squared residuals (which represents a measure of the
overall prediction error):
Hypothesis testing
The null hypothesis 𝐻0: 𝛽1 = 0 VS the alternative hypothesis H1: 𝛽1 ≠ 0
• P-values are used in hypothesis testing to help decide whether to reject the null hypothesis.
• The smaller the p-value, the more likely you are to reject the null hypothesis.
• The most common threshold is p < 0.05, which means that the data is likely to occur less than 5% of
the time under the null hypothesis.
t and P>|t|:
They tell us something about whether or not the relationship between the predictor and the response is
significant.
A hypothesis test: the null hypothesis 𝑯𝟎: 𝜷𝟏 = 𝟎 VS the alternative hypothesis 𝐇𝟏: 𝜷𝟏 ≠ 𝟎
▪ To test the null hypothesis, we need to determine if our estimate 𝛽1 is sufficiently far from zero that
we can be confident that 𝛽1 is non-zero.
▪ To test this hypothesis, linear regression performs a t-test and the outputs of this test are the tstatistic
(t) and the p-value (P>|t|).
▪ If the t-statistic is very large, the alternative hypothesis is likely to be true. A basic rule to reject the null
hypothesis in favour of the alternative hypothesis is when the p-value is smaller than 0.05.
This would mean that the probability of observing such an extreme value for 𝛽1 is 5%, given that the
null hypothesis is true. Hence, it is very unlikely that the null hypothesis is true and we can say that the
predictor has a significant influence on the 5% significance level.
Strictly speaking a 95% confidence interval means that if we were to take 100 different samples
(datasets) and compute a 95% confidence interval from each sample, then approximately 95 of the 100
confidence intervals will contain the ‘true’ value.
The R-squared (R 2 ) is a measure of goodness-of-fit and shows us the explanatory power of our model.
R 2 = 𝑇𝑆𝑆 − 𝑅𝑆𝑆 / 𝑇𝑆𝑆 = 1 − 𝑅𝑆𝑆 /𝑇𝑆𝑆
RSS represents the residual sum of squares 𝑅𝑆𝑆 = 𝑆𝑆𝑅 = sum( 𝑦𝑖 – y𝑖^)^2 -- variability that is
unexplained by the model.
The R^2 indicates how much variance in the response variable (Y) is explained by the model.
• The R^2 is a number between 0 and 1. The closer to 0 (1) means that the model does not explain
(explains) a lot of the variability in the response.
• In simple linear regression, the R^2 equals the squared correlation between predictor and response
Holdout data
Training data
• Model validation
Multiple Linear Regression
Multiple linear regression n enables you to predict a continuous response using two or more continuous
or discrete predictors.
The basic equation that defines the multiple linear regression (MLR) model is
• The j coefficients 𝛽𝑗 are called the partial slopes of the response variable with respect to the j-th
predictor 𝑥𝑗.
Estimation of 𝛽𝑗 ’s is still performed using least squares and all the quantities we introduced in simple
linear regression model are used in multiple regression in a similar way.
Specifically, the principle to find the optimal coefficient estimates remains the same as in simple linear
regression, namely minimising the residual sum of squares (RSS):
Once we have the estimates of the coefficient, we can make a prediction for a single instance 𝒙𝒊 = 𝑥𝑖1,
𝑥𝑖2, … , 𝑥𝑖𝑝 using the equation
In multiple linear regression, the coefficient 𝛽^ 𝑗 would represent the average effect of one unit increase
in 𝑥𝑗 while keeping all other predictors fixed.
MLR: OLS
SLR: box office sales depend on advertising spending
MLR: box office sales depend on economic growth and advertising spending 𝑌 = 𝛽0 + 𝛽1𝑋1 + 𝛽2𝑋2 +
epsilon
MLR OLS solution will choose the plane that is as close as possible to the data points, so it will minimise
the squared vertical distances between each observation and the plane.
Interpretation
The coefficients are different compared to SLR.
The coefficient of advertising is saying that, with one unit increase in spending, the visitor count would
increase by 1.54 on average, while keeping economic growth constant.
The OLS procedure and corresponding interpretation can easily be extended to 3,4, or 𝑝 predicators.
Assumptions
𝑌 = 𝛽0 + 𝛽1𝑋1 + 𝛽2𝑋2 + ⋯ + 𝛽𝑝𝑋𝑝 + epsilon
Assumption 1: linear regression assumes that there is a linear relationship between predictor and
response.
To test whether or not there is a linear relationship between a predictor and a response, you can make a
scatterplot between each predictor and the response. If the relationship is linear, the outcome variable
will linearly increase or decrease in terms of the response. Another option is to make a residual plot.
This plot depicts the residuals against the fitted values. If there is a linear relationship, the points should
be randomly distributed around the horizontal line.
Assumption 2: Multiple linear regression assumes that there is NO multicollinearity in the data.
This means that the predictors cannot be too closely related to each other or, in others words, that the
correlation between the predictors is not too high. The easiest, and most effective, way to deal with
multicollinearity is to delete highly correlated variables.
1. calculate the Pearson correlation matrix among all predictors. This correlation matrix shows you the
correlation between all predictors and a rule of thumb is that the correlation between two predictors
should be smaller than 0.80.
2. A second way is to calculate the Variance Inflation Factor (VIF). The VIF for the j-th predictor in a
linear regression model with p predictors is defined as
𝑉𝐼𝐹𝑗 = 1 /1 – 𝑅𝑗^2
where 𝑅𝑗^2 is the R-squared value for the (auxiliary) regression of the j-th predictor versus all the
remaining ones.
The VIF provides a measure of the reduction in the precision of a coefficient estimate. VIF is a number
greater than or equal to 1.
• When it is equal to 1, it means that the predictor is not affected by any collinearity problem.
• The larger it is, the stronger is the association of that predictor with all the other ones in the model. •
If the VIF of a predictor is larger than 10, we flag the predictor as affected by a severe collinearity
problem.
• In other words, there is no autocorrelation between the error terms. Independent means that the
error term of a certain observation i does not say anything about the error term of observation j.
• Autocorrelation can be tested with a scatterplot or with an autocorrelation test, e.g. Durbin-
Watson test. The null hypothesis for this test is that there is no linear autocorrelation between the
error terms.
Assumption 5: homoscedasticity
• If the error terms are homoscedastic, we see a chaotic scatterplot of the error terms with no real
relationship. We don’t notice a changing spread of the error terms, instead the spread of the error
terms is constant and we have a constant variance.
• If there is no homoscedasticity (or heteroscedasticity), we see a pattern in the scatter plot. Often
this pattern has a funnel shape and this represents a changing spread of the error terms.
Why feature selection?
Many Machine Learning problems involve thousands or even millions of features for each training
instance. Not only does this make training extremely slow, it can also make it much harder to find a
good solution. This problem is often referred to as the curse of dimensionality.
• Sparsity of data occurs when moving to higher dimensions, which leads to weak statistical significance.
– need more data
• The model parameters to be estimated and training time increase as number of features increases. –
computational burden
• Feature selection: selecting the most useful features to train on among existing features.
1. Filter methods
To find a single measure that relates each independent variable to the dependent variable, which
can be used to score the importance of the independent variable, e.g. Correlation. The outcome can
be used for ranking. computationally fast and model-free
2. Wrapper methods
to generate a vast number of models with various subsets of independent variables to check which
subset give the best performance. Advantages: testing various combinations of predictors;
interaction taken into account Drawback: Model specific and computationally expensive.
3. Embedded methods
Some algorithms have built-in feature selection thanks to their way of modelling, e.g. Random
Forest and Lasso regression.
where 𝑠𝑖 and 𝑠𝑌 are the (estimated) standard deviations of 𝑋𝑖 and 𝑌, respectively 𝛽𝑖 ∗ refer to how
many standard deviations a dependent variable will change, per standard deviation increase in the
predictor variable The coefficients 𝛽𝑖 ∗ are independent of the involved variables' units of measurement
(unitless).
where 𝑑𝑖 = 𝑅 𝑥𝑖 − 𝑅(𝑦𝑖 ) is the difference in ranks between the rank of each observation
𝑅 𝑥𝑖 and 𝑅(𝑦𝑖 )
• Non-parametric
• Test for linear relationships
Correlation vs Causation
Correlation may not necessarily leads to causality. One variable’s influence might be overshadowed by
the others.