0% found this document useful (0 votes)
30 views

Lecture 13 - Unsupervised Learning, PCA ICA

Uploaded by

kateryna.koval
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views

Lecture 13 - Unsupervised Learning, PCA ICA

Uploaded by

kateryna.koval
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 50

Generative

Unsupervised Learning
Agenda
1. Unsupervised learning
2. Clustering
a) K-means clustering
b) Hierarchical clustering
c) Mean shift clustering
d) DBSCAN
e) Clustering metrics
3. Anomaly Detection
a) One-class support vector machine
b) Isolation forest
4. Dimensionality reduction
a) Principal component analysis (PCA)
b) Independent component analysis (ICA)
Unsupervised Learning

Unsupervised learning Supervised learning

No labels Labels are given


Find hidden structure in the data Learn a mapping from X to y, based
on train dataset
Main algorithms:
• clustering Main algorithms:
• dimensionality reduction • classification
• anomaly detection • regression
Training and Validation

Historical data Data cleaning and ML algorithms Patterns and rules Validation on real
preparation data

Inference

Historical data Data cleaning and Exploitation of Real data


preparation algorithms and rules grouping
Clustering
Main task: determine distinct groups (clusters), putting similar data into one group.
Clustering should satisfy the following conditions:
- points inside a cluster should be similar - high inter-class cluster similarity;
- points from different clusters should be far away from each other - low intra-
class cluster similarity.

What’s the natural grouping between the data points?


What’s the best way for grouping points?
What is the number of clusters? Can it be determined automatically?
How to avoid trivial clusters?
Whether to allow very big or very small clusters?
What metrics are used for clustering evaluation?
Clustering methods taxonomy
Distance between points

One way to find similar points - define a distance function between points.
The following metrics are often used:

- Euclidean distance
- Manhattan distance
- Minkowski distance
- Jaccard similarity index (for categorical data)
- Cosine similarity
K-means
1. Define the number of clusters 𝑘.
2. Randomly initialize 𝑘 centroids.
3. Calculate euclidean distance between
centroids and data points.
4. Group data points into clusters around
centroids.
5. Recalculate centroids. Let each centroid
be a mean of data points in a
corresponding cluster.
6. Repeat steps 3-5 until convergence.
K-means problems
- one should know a number of clusters a priori
- method isn’t robust to clusters of different size, density and shape.
- method isn’t robust to outliers
- inapplicable to large/high dimensional datasets
Hierarchical clustering

Repeat steps 1-2, until termination Data is split into two parts
3 condition is met
1
2 The closest points/clusters are merged
2 The best splitting is determined

1 Each point is considered as a separate


cluster
3 Repeat steps 1-2, until termination
condition is met

❑ predefined cluster number is achieved

Termination ❑ all points are merged into a single cluster ❑ each point forms a cluster
conditions ❑ distance between the closest clusters is bigger ❑ maximum distance between any partitions of a
than predefined threshold cluster is smaller than a predefined threshold.
Hierarchical clustering
Agglomerative clustering represents a hierarchy of clusters in a tree form -
dendrogram.

Root of the dendrogram – a single cluster, which contains all data points.
Each leaf is a separate point.
This model can be easily interpreted and shows similarity between
different data points.
Final clusters are obtained by pruning the dendrogram at some level. Thus,
one can choose the number of clusters to use.
Example

Multi-dimensional visualizations for clustering


DBSCAN: density-based spatial clustering
of applications with noise
Suitable for data with large number of groups that intersect.

Previous algorithms were based on distance function: close in distance points were put in the
same cluster. In case of scattered data, algorithms, that are based only on distance, are tend to
perform badly and consider all points as clusters.

DBSCAN is robust to outliers and can be also considered as an anomaly detection algorithm.

The algorithm can handle arbitrary cluster shapes and doesn’t require to specify number of
clusters a priori.

Parameters:
ε – the maximum neighborhood radius
n – the minimum number of points to form a region
DBSCAN algorithm
1. For current point calculate the distance to all other points.
2. Mark all points, that are in epsilon-neighborhood of a
current point, as neighbors.
3. If number of neighbors is greater or equal to n, merge
points to form a cluster.
4. If number of neighbors is smaller than n - mark point as an
outlier.
5. Continue for other points, until all are marked as outliers or
form a cluster.
Mean Shift Clustering
Non-parametric feature-space analysis
technique for locating the maxima of a
density function.
Alternative name - mode-seeking algorithm.

1. Define the kernel size and move a window on a current point.


2. Calculate a centroid over all points, that are located inside a window.
3. Move a window to centroid.
4. Repeat steps 1-3, until convergence.

More information here.


PROS Cons

• Window size can


• Doesn’t require to specify number
significantly affect the
of clusters a priori
performance. Wrong size can
• Can handle various cluster shapes merge the modes or create
multiple false positive
• Can handle various feature spaces
modes.
• Robust to outliers
• Сomputationally expensive

OpenCV: cv.meanShift
Sklearn: sklearn.cluster.MeanShift
How to choose number of clusters?

Do not forget that clusters are, in large part, on the


eye of the beholder
(Estivill-Castro, 2002)

Clusters number is chosen:


❑ based on metrics, so that, clusters explain the variance in the data.
❑ intuitively or based on prior knowledge (for example, from
customer)
Clustering Metrics
Elbow rule
1. Group data on 2, 3, ..., n clusters.
2. For each cluster calculate the average
distance to the centroid and average
the results (D)
3. Plot the graph D = f(n).
4. Optimal cluster number N is located
immediately after the last steep
decline, between DN-1 and DN.

yellowbrick.cluster.elbow.KElbowVisualizer
Clustering Metrics
Silhoutte value measures the consistency within clusters of data.

It shows how similar an object is to its own cluster (cohesion) compared to other
clusters (separation).

The silhouette ranges from −1 to +1, where a high


value indicates that the object is well matched to its
own cluster and poorly matched to neighboring
clusters.

Clustering metrics on synthetic da


Sklearn Silhouette analysis ta
sklearn.metrics.silhouette_score
Anomaly Detection

Anomaly (also called outlier or noise) -


unexpected value or event, that significantly
differs from normal behaviour.

Their appearance may indicate objective


internal changes or external interventions.

Examples of anomaly detection here.


Anomaly Detection

• Anomalies occur very rarely (anomaly


detection - problem
with imbalanced classes);
• Properties of anomalies significantly
differs from normal instances.
Gaussians Mixture Models and
Expectation Maximization
Each separate cluster can be
represented as a Gaussian distribution,
so data is a mixture of Gaussian
distributions.
GMM tries to group points, which
come from the same distribution.

If data is a realization of the mixture of three Gaussians with means (μ1, μ2,
μ3) and standard deviations (σ1, σ2, σ3), then GMM will identify for each
point a probability distribution among different clusters.
One-class SVM
Modification of the traditional SVM, that
transforms feature space, so that
observations lay as far as possible from the
origin. Two common approaches are
Schölkopf et al. and Tax & Duin.

As a result, on one side of the curve, there


are “normal” data points, and on the other
side - abnormal values or anomalies.

The algorithm can be used for novelty detection.

Training dataset shouldn’t contain any anomalies.


Kernel trick can be used for non-linear transformations of a feature-space.
Isolation Forest (iForest)
Based on the assumption, that there is a small amount of
anomalies and they’re significantly different from normal
observations.
That’s why iForest deliberately isolate abnormal points.

The key idea is, that it’s faster (smaller depth of the tree) to
separate abnormal points.

Implementation of iForest:
❑ Training set is used to build isolation forests.
❑ Each point of the test set is run through iForest and receives an
anomaly score.

Has linear complexity and can be easily applicable to large


datasets. Usually, performs better than one-class SVM
approach.
What is dimensionality?

Dimensionality is the number of dimensions, features, or variables associated


with a sample of data.

In many cases, numerous dimensions correlate, e.g. two (or more) variables
explain the same phenomenon.

Furthermore, each feature adds bias to data and represent noise.

Thus, with many variables explaining the same phenomenon, we also increase
the bias in the data, which makes it more difficult for algorithm to find relations.

In the end, the algorithm might end up mimicking the noise, instead of extracting
the meaningful information.
Suppose we have a system that analyzes the ECG data from wearable device
and produces a signal, when there’s a signal of heart stroke.

In ECG signal, much of it is noise, and by doing this (reducing the


dimensions), we improve the algorithm accuracy and performance (time and
computational).
What is dimensionality?
Unsupervised transformation creates a new representation of
the data that are easier to understand (for human) and analyze
(for ML algorithm)

Dimensionality reduction takes high-dimensional representation


of the data and summarizes the data with only essential
characteristics, namely with fewer features.

Dimensionality reduction manages data sparsity while keeping


the useful information.

Common application – reduction in two dimensions for


visualization purposes.
Dimensionality reduction BRANCHES
Linear projection – projecting data from Manifold learning, nonlinear dimensionality
high- to low-dimensional space reduction

• Principal component analysis PCA • Isomap


• Singular value decomposition SVD • Multidimensional scaling MDS
• Linear discriminant analysis LDA • Locally linear embedding LLE
• Non-negative matrix factorization • t-distributed stochastic neighbour
NNMF embedding t-SNE
• Dictionary learning
• Random trees embedding
• Independent component analysis ICA
Dimensionality reduction APPLICATIONS

• Pre-processing and feature engineering – isolate the most important


components, and thus reduce the calculations and training time
• Noise reduction (filtering). The noise does not comprise a large portion of
variation within the data, so we can remove noise from the signal by removing
the smaller components in variation, and then restoring the data back to the
original dataspace.
• Generating plausible artificial datasets – the method divides the dataset
into the components of information, and thus gives us the opportunity to
investigate the effect of each component to generate a new dataset samples
Principal Components Analysis PCA

AIM: to find low-dimensional representation of the data, retaining


as much of variation, as possible.

It finds the features with high correlations and attempts combine


the highly correlated features and represent them with smaller
number of linearly uncorrelated features.
When to use PCA
1. We need to reduce the dimensionality, but do not know which
variables to remove.
2. It is critical to have only independent variables.
3. You are able to tolerate non-interpretability of variables.

Yes, Yes/No, Yes


PCA
Yes, Yes/No, No
PCA algorithm
1. Standardize the variables columnwise:
𝒏
𝟏
𝟐
𝜎 =
𝒏−𝟏
∑ ( 𝑿 𝒊 − 𝑿 )( 𝑿 𝒊 − 𝑿 )
Having done this, we obtain 𝒊=𝟏
matrix Z of standardized features.

2. Get covariance matrix: 𝒏


𝟏
𝜎 (𝒙 , 𝒚 )= ∑ ( 𝑿 − 𝑿)(𝒀 𝒊 −𝒀 )
𝒏− 𝟏 𝒊=𝟏 𝒊
S ==

Detailed explanations on
covariance matrix are here
PCA algorithm
We can obtain covariance matrix:

3. Calculate eigenvector and eigenvalues


eigenvalue
S v = λv

eigenvector

Covariance Matrix X Eigenvector = Eigenvalue X Eigenvector


(n x n) (n x k) = (n x k)
k is the number of features we want to keep
PCA algorithm

4. Reorient data

𝑍 =𝑍 𝑉
(J x k) = (J x n) (n x k)
J rows and k principal features

Multiply by 0th value of eigenvector, produced by np package.

Thus, we obtain the feature matrix that keep maximum variance at


minimal number of features. This matrix is then used as feature matrix
for ML algorithms, or for visualizations.
PCA algorithm
1. Covariance matrix contains the relationships between the variables in
our feature matrix.

2. Eigenvectors represent directions, eigenvalues represent importance:


the larger values correspond to more important directions.

3. We assume that large variability on the certain direction is associated


with the behavior of the goal variable. In other word, small variability usually
is associated with noise, while the large variability is typically a signal.
PCA in practice
PCA implementation for face recognition: EIGENFACES and simpler version,
already in Collab here, dataset, explained here.

Scikit learn PCA

Other types of PCA with explanations and examples

Water treatment PCA manual, data are here, dataset description

Embeddings Projector
Independent component analysis
In the real world data, the case when many independent signals
are embedded into the features.
This group of problems can be referred as ‘cocktail party problem’.

We need to separate the signals from individuals with only data from
these microphones available.
Problem of overlapping (blending) signals separation
ICA algorithm
n – number of independent data sources.
The records from microphones will look like:
mixing matrix

dataset sources
X = As

A is a matrix of linear combination of signals (with size n x n) , or mixing matrix, i.e.


the signals from microphones (n people speaking and n microphones)
X is a random vector (with size n) representing multivariate input measurements,
namely
is the acoustic reading recorded by the mic j at time i.
s is a latent source vectors (with size n), components of which are independently
distributed random variables,
is the sound of speaker j at time i.
Given X, the AIM of independent component analysis is:

- estimate mixing matrix A


- estimate the distribution of sources , j = 1, …, n

X = As

s = A-1X
reassigned for convenience, unmixing matrix
W = A-1

𝑖 𝑖
s = WX 𝑠 =𝑊𝑥

[ ]
− 𝑤𝑇
1 −
𝑇
If we denote i-th row as : 𝑊 = − 𝑤2 − 𝑖 𝑇
𝑠 𝑗 =𝑤 𝑗 𝑥
𝑖
recovered source j

− 𝑤𝑇
𝑛 −
ICA ambiguity
1. If we have n sound sources, their order might be different, i.e. we are
not able to distinguish Z from Z*P, where P is permutation matrix

Thus, we are not able to distinguish W from PW. However, this is not critical
for the majority of applications.

2. We are not able to recover the correct scale .


If a mixing matrix is 2A, and the signal is 0.5s, the signal X is the same.
Again, this is not critical, as refers to volume scaling.
Stumble on Gaussian
We are able to recover s sources from X data only if s are non-Gaussian.
The density of Gaussian distribution is rotationally symmetric.

For data with n=2, s ~ , where I is 2 x 2 identity matrix.


If X is Gaussian, then s ~

So, if we have arbitrary rotational matrix so that

Thus, if mixing matrix has been replaced by

The distribution of is also Gaussian, so ~

Irrespective of mixing matrix, A or we obtain data from the same distribution .


The arbitrary rotational component cannot be determined from the data, because the data is
rotationally symmetric.
Density and
Linear Transformation
s is a random variable with density
X = As

What is density
Remind, that W = A-1

It seems logical to get s = WX, and then get using this expression.

s ~ Uniform[0,1], then

Let A=2, the X = 2s. X is distributed uniformly in interval [0, 2], or


W = 0.5. Thus, we calculate density as:
ICA algorithm
The distribution of each source is represented by density:

This representation allows keeping the assumption of sources independence

𝑛
𝑝 ( 𝑠 )=∏ 𝑝 𝑠 ( 𝑤 𝑇𝑗 𝑥 )|𝑊 |
𝑗=1
𝑧0

The density can be obtained as: 𝐹 ( 𝑧 0 ) =𝑃 ( 𝑧 < 𝑧 0 ) = ∫ 𝑝 𝑧 ( 𝑧 ) 𝑑𝑧


−∞
ICA algorithm
Thus, to specify density, we need to find its distribution.
Gaussian is not an option.

In case no prior knowledge about the distribution, we take sigmoid:

1
𝑔 ( 𝑠 )= −𝑠 𝑝 𝑠 ( 𝑠 ) =𝑔 ′ ( 𝑠 )
1+𝑒

If there is a reliable information about the distribution shape, the you may use this
distribution instead of sigmoid.

Note that the data need to be pre-processed to have zero mean.


ICA algorithm
) max ℓ ( 𝑊 )

Thus, we can move to gradient ascent:

𝑊 =¿𝑊 +𝛼 ¿
As the algorithm converges, we recover individual signal as

Scikit learn ICA


Bonus: UMAP vs t-SNE
t-distributed Stochastic Neighbour Projection
Embedding

Algorithms:
Uniform Manifold Approximation and
• Compute high dimensional probabilities p.
• Compute low dimensional probabilities q.
• Calculate the difference between the probabilities by a given cost function C(p,q).
• Minimize the cost function.

- SNE
- tSNE
Bonus: UMAP vs t-SNE

https://pair-code.github.io/understanding-umap/
References
Cheng, Y. (1995). Mean shift, mode seeking, and clustering. IEEE Transactions on pattern analysis and machine
intelligence, 17(8), 790-799.

Dangeti, P. (2017). Statistics for machine learning: Build supervised, unsupervised, and reinforcement learning
models using both Python and R. Birmingham, UK : Packt Publishing.
Ertel, W., Black, N., & Mast, F. (2017). Introduction to artificial intelligence. Cham, Switzerland : Springer.

Igual, L., & Seguí, S. (2017). Introduction to Data Science: A Python approach to concepts, techniques and
applications. Springer International Publishing : Imprint : Springer. *

Johnston, B., Jones, A., Kruger, C., & Safari, an O'Reilly Media Company. (2019). Applied unsupervised
learning with Python. Packt Publishing, 2019.

Liu, F. T., Ting, K. M., Zhou, Z-H. (2009). Isolation forest. 2008 Eighth IEEE International Conference on Data
Mining.
Liu, F. T., Ting, K. M., Zhou, Z-H. (2012). Isolation-based anomaly detection. ACM Transactions on Knowledge
Discovery from Data, 6(1)
References
Patel, A. A. (2019). Hands-On Unsupervised learning using Python: How to build applied machine learning

solutions from unlabeled data. O'Reilly Media. *


Pradhan M., Kumar U. (2019). Machine Learning using Python. Wiley India. *

Swamynathan, M. (2019). Mastering Machine Learning with Python in Six Steps: A Practical Implementation
Guide to Predictive Data Analytics Using Python. Berkeley, CA: Apress L.P.
References
Ayramo, S., Karkkainen, T. (2006). Introduction to partitioning-based clustering methods with a robust example.
Reports of the Department of Mathematical Information Technology Series C. Software and Computational
Engineering, 1, 1-34.

Comaniciu, D., Meer, P. (2002). Mean Shift: A robust approach toward feature space analysis. IEEE Transactions
on pattern analysis and machine intelligence, 24(5), 603-619.

Estivill-Castro, V. (2002). Why so many clustering algorithms — A Position Paper. SIGKDD Explorations, 4 (1),
65-75.
Ghassabeh, Y., A. (2015).
A sufficient condition for the convergence of the mean shift algorithm with Gaussian kernel, Journal of
Multivariate Analysis, 135, 1-10.
Zimek, A., & Filzmoser, P. (2018). There and back again: Outlier detection between statistical reasoning and
data mining algorithms. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 8(6).

You might also like