0% found this document useful (0 votes)

87 views7 pages

C MDA

Uploaded by

iyaddurrani

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

87 views7 pages

C MDA

Uploaded by

iyaddurrani

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

www.technicalanalysts.

com
NEWS NEWS
10 11

Introducing our new Compliance Officer:

Cluster-based Feature Selection
Meet Vince Harvey, Compliance Cubed Ltd
Interview with Richard Adcock MSTA STA Company Secretary
Abstract
side of the regulators. found ways to make it work. With a
clear business plan that is effectively Feature importance in machine learning shows how much information a feature
Richard: What do you think are communicated to staff, new ways contributes when building a supervised learning model, in order that we can
the biggest challenges facing to build teams will emerge and new exclude uninformative features from the predictive model (feature selection). It also
organisations when it comes technologies will become available. improves human interpretation of the resulting model. Recently, Man & Chan (2021)
to navigating the regulatory The FCA, for example has, been compared the stability of features selected by different methods such as MDA, SHAP,
requirements? talking about innovation as the key or LIME when they are subjected to the computational randomness of the selection
Vince: The issue I hear most often is to reaching a wider audience - it has algorithms. In this article, we study whether the cluster-based MDA (cMDA) method
about moving goal posts - it is difficult calculated that in the UK there are proposed by López de Prado (2020) improves predictive performance, feature
Richard: Delighted to meet you Vince to run a business as well as monitor around 7.5m people with investible stability and model interpretability. We applied cMDA to two synthetic datasets,
and to have you join the STA Ethics regulators’ websites to keep on assets of £10,000 or more that are a clinical public dataset and two financial datasets. In all cases, the stability and
and Compliance Committee as our track. The other is the sheer volume sitting on deposit. Individually, those interpretability of the cMDA-selected features are superior to MDA-selected features.
Compliance Officer. Tell us a little and ‘legalese’ - I mustn’t complain clients aren’t attractive to mainstream
about yourself. though because, if understanding the advice firms but - with growing
Vince: I’m married to Brigitte and we rules was easy, firms wouldn’t need acceptance of virtual interactions and
have three sons, three daughters in law compliance consultants. online services - this is a market that
Xin Man 1. Introduction
and five grandchildren. I have worked could become more attractive. Xin Man works as a quantitative research
Financial investors are often reluctant to trust machine-learning algorithms because
in financial services pretty much since Richard: Will there be an end to Resilience is another area which the consultant for PredictNow.ai. and QTS
of their “black-box” nature: there is no transparency and no justification of how they
I graduated from university and in regulatory ‘creep’? FCA has been emphasising in its Capital Management, LLC. She holds a
arrive at their predictions. Feature selection is a technique that attempts to improve
that time have worked for a range of Vince: Unfortunately, not for the communications. Having enough cash masters degree in Financial Mathematics
the transparency and interpretability of machine learning models by ranking the
businesses, moving into a compliance foreseeable future! The UK has to weather storms is one thing, but from McMaster University, Canada.
importance of the features used. However, feature selection algorithms often
role in 2000. Compliance Cubed was brought into UK legislation all of the making sure that systems are robust
suffer from a stability problem, as discussed by Man & Chan (2021). As we change
set up in 2013 and keeps me very busy, previous EU regulations, but we shall and that the right people are engaged
fortunately. I am looking forward to see in 2021 and beyond whether requires thought on what the new the random seed in training a machine learning model, the top selected features
contributing to the discussions in the our Government tries to maintain ‘normal’ could look like. may also change, reducing interpretability. In this paper, we investigate a cluster-
Ethics and Compliance Committee and equivalence or allow financial services based technique pioneered by López de Prado (2020) to see whether it can improve
hope that I am able to add some value. regulations to diverge. It’s possible, Richard: What impact do you think stability and interpretability of the important features.
given the volume of financial services leaving Europe will have on the
Richard: What sort of clients do activity with other European countries, regulatory landscape? Clustering is an effective unsupervised method for grouping features so that the
you advise? that we will attempt to follow EU Vince: As indicated in an earlier answer, those in the same group are more like each other than those in other groups. A good
Vince: Throughout my career, the developments - without a seat at the the impact depends on the political clustering algorithm can minimise the between-clusters similarity and maximise
word diversification has been central table in determining what shape those lead provided; our regulators will have the within-cluster similarity. Popular clustering approaches include K-means and
to many conversations. Having been developments take. Alternatively, to implement the will of Parliament. hierarchical clustering. The K-means algorithm requires the user to fix the number
made redundant four times previously we could look to strike out with our As I write, the terms of our relationship K of clusters. It is intended for situations in which all features are numerical and
and found new employers, when own set of rules and seek to compete after the transition are still to be Euclidean distance is chosen as the metric. By contrast, hierarchical algorithms do
it happened the fifth time, I finally globally. Either way, our rules will agreed and I have little confidence not require K to be predetermined and can also adapt to categorical features and
got the message that I should do continue to change. that it will be a ‘good’ deal. The EU non-Euclidean metrics. Clusters at each level of the hierarchy are created by merging
something for myself. While many cannot appear to be generous to the clusters at a lower level and we end up with a single cluster containing all features at
of my clients are investment firms, Richard: As a new ‘normal’ begins to UK as that would cause political issues the highest level. More details regarding clustering algorithms can be seen in Hastie,
I also work with some insurance emerge, what can companies do to with other nations. My guess is that, Ernest P. Chan Tobshirani and Friedman (2009) and López de Prado (2020).
brokers, payment-service businesses, ensure they are Covid proof? in financial services, we will adopt
employee-benefits advisers, credit Vince: Having a clear business plan is a equivalence (maybe not officially) and Dr Ernest Chan is the Managing Member With a clustering algorithm, the full feature space is split into multiple non-overlapping
brokers and mortgage advisers. This good start - too many businesses are so our rules will evolve as though we of QTS Capital Management, LLC., a clusters. Using the rank-based importance score, we calculate an importance score
provides a varied workload and some ticking along doing the same things were still in Europe. commodity-pool operator and trading for each cluster. Those clusters with scores higher than a chosen threshold can be
protection for the business if one they have done for years. Many people advisor specialising in crisis alpha selected for training a machine learning model. Clustering features with ‘similar’
sector hits a difficult patch. have said to me that virtual meetings Richard: Thank you, Vince, for your and machine learning. He also runs information is a straightforward way to isolate irrelevant features. Features that
were thought to be likely to be normal time. We look forward to utilising your PredictNow.ai, A financial machine- do not belong to any important clusters can be dropped. The chance of losing
Part of my work has been to help new in around five years - what Covid has valuable professional expertise. learning start-up. useful information can also be reduced as any features belonging to a significantly
firms obtain FCA authorisation but an done is accelerate their acceptance. informative cluster will not be discarded.
increasing proportion of my time is The ability to communicate and For more information on Vince Harvey
now spent advising existing authorised collaborate as a team has been visit Compliance Cubed Ltd Clustering improves interpretability as a cluster offers a higher level of abstraction.
businesses on how to stay on the right challenged but businesses have For example, in finance, we may find that volatilities computed using three-, five-
www.technicalanalysts.com
RESEARCH RESEARCH
12 13

and seven-day lookback periods are in the same cluster. That clearly identifies the cluster as “historical volatility”. In addition, features. As defined by López de Prado (2018), we have:
clustering may improve the stability of the important features, since their relative importance within a cluster won’t cause some
of them to be dropped, and it is less likely that the importance rank of a whole cluster of features will change drastically if we use 1) informative features that are used to determine the label; and
a different random seed. As discussed in Man & Chan (2021), stability of features improves interpretability. 2) noisy features that bear no information on determining the labels and are drawn from standard normal distributions.

The rest of this article is organised as follows: According to the descriptions in scikit-learn.org, the informative features are drawn independently from a standard normal
distribution. However, to introduce clusters into our data, we first randomly draw multiple ‘centroids’ and generate informative
• Section 2 introduces the cMDA algorithm and its use of hierarchical clustering to compute importance score at the cluster level; features around them from normal distributions centred around these centroids. We provide a detailed description of the
algorithm for creating these clustered features, and how they map to the labels, in Appendix 2.
• Section 3 compares the predictive performance of MDA vs cMDA using two synthetic datasets;
The dataset has 1,000 samples and 40 features comprised of 20 informative and 20 noisy features. These 20 informative features
• Section 4 compares the predictive performance of MDA vs cMDA on two popular datasets, including a financial dataset that form three clusters with six or seven features in each cluster.
uses technical and fundamental indicators to predict the S&P 500 stock index excess returns.
The selected features are analysed in Table 1. From Panel A, cMDA tends to keep all the informative features but it also includes a
• Finally, the algorithm is applied to our proprietary trading strategy returns dataset to see if it can identify interpretable small number of noisy features. In contrast, MDA chooses far fewer features but filters out all the noisy features. The downside is
clusters and improve the strategy’s performance. We find high stability and interpretability of the selected clusters in these that it also drops a lot of informative features. We may figuratively say that cMDA has a higher recall but lower precision than MDA.
financial applications, which should make machine learning employing this technique appealing to investors.
Denote ‘I_m_n’ as the m-th informative feature which is assigned to n-th synthetic cluster. For example, ‘I_20_2’ means the 20th
informative feature which belongs to the 2nd synthetic cluster. ‘N_m’ represents the m-th noisy feature. Panel B shows that all
2. cMDA using hierarchical clustering the features in the ‘0’ synthetic classification cluster are put into the most important selected cluster. The informative features in
‘1’ and ‘2’ synthetic clusters are not recovered by the algorithm since each of them has two features in the same selected cluster
Cluster-based feature selection consists of two steps: clustering features and ranking clusters. To begin clustering features, we and the rest of their features are grouped into another selected cluster. Panel C shows the synthetic regression data selects all the
define a distance matrix from the pair-wise correlations of the features Di,j)= √(½(1-pi,j). As discussed in López de Prado (2020), informative features of their original clusters to form the top two most important selected clusters, but each cluster also includes
the ideal distance matrix should be based on one of the information-theoretic metrics, but the correlation matrix is still the one one or two noisy features.
most commonly used in finance. The selection of distance matrix won’t affect the subsequent procedures, though it may affect
the predictive performance.

Next, a clustering method should be used to split the feature set into smaller sets according to the distance matrix. K-means and
hierarchical algorithms are popular clustering methods. The K-means clustering algorithm fixes the number K of clusters and the PREDICTNOW.AI
observations are assigned to each cluster based on the distance to the centre point. By contrast, hierarchical clustering works
in a ‘bottom-up’ manner. Starting from the bottom, every single feature is taken as a cluster. As we ascend to the next level, the
two closest clusters are merged. At the end of the process, all the features will be included in a single cluster. We then cut the
hierarchical tree at the proper level to create an optimal set of clusters. The outputs of hierarchical clustering have more structure Financial Machine Learning: No-code/API
and are more informative than the unstructured set of flat clusters returned by the K-means algorithm.
Computes the Probability of Profit for your next investment.
In the following analysis, we use the hierarchical algorithm as the clustering method:

The number of clusters is determined by finding the number (from 2 to the number of samples minus 1) that maximises the
“clustering quality” q. The clustering quality is related to the silhouette coefficient (Rousseeuw, 1987) which represents how similar
a sample is to samples in its own cluster compared with those in other clusters. For the data sample i, its silhouette coefficient
bi-ai
is defined as Si max {ai-bi
, where ai is the average distance between i and all other samples in the same cluster, and bi is the
average distance between i and all the samples in the nearest cluster of which i is not a member. Then for a given partition, the
E[S]
measure of clustering quality q is defined as q= Std[S] , where E[S] and Std[S] are the mean and variance of silhouette coefficients
for all samples in the training data.

After finding the optimal number of clusters based on maximising q and assigning the features to each cluster, the feature
importance algorithm is performed on the clusters rather than individual features. This means that during MDA feature selection,
all the features in a cluster are permuted at the same time, as described in López de Prado (2020). Since this article focuses on
how the clustering method can add value to model performance rather than comparison across different feature importance
algorithms, we omit presenting the implementation of clustered LIME and SHAP and only discuss clustered MDA. If a feature
is isolated by a cluster, MDA and clustered MDA are the same. The feature importance is measured by the rank-based score
proposed by Man & Chan (2021). As the importance score of a cluster is determined by the mean of the importance scores of the
features contained in it, a large cluster won’t necessarily be more important than a smaller cluster with fewer features.

3. Predictive Performance on Synthetic Data

To test how the proposed method responds to synthetic data, we construct a dataset composed of both informative and noisy
www.technicalanalysts.com
RESEARCH RESEARCH
14 15

Table 1: Selected Features on Synthetic Datasets Table 2: Prediction Performance Comparison on Synthetic Datasets
Panel A: Number of informative features selected by cMDA and MDA
Synthetic Classification Synthetic Regression

Synthetic Classification Synthetic Regression F1 AUC Acc MSE MAE R2

cMDA All 20 informative features, All 20 informative features, cMDA 0.975 0.998 0.973 545460.60 585.39 0.9626
1 noisy feature 5 noisy features
MDA 0.960 0.996 0.957 436676.44 510.08 0.9700
MDA Only 11 informative features, Only 9 informative features,
Full 0.975 0.995 0.973 718870.06 662.79 0.9607
0 noisy feature 0 noisy feature

Given that the predictive performances of cMDA and MDA are close, cMDA should be favoured given the increase in interpretability
Panel B: Selected clusters in classification data Panel C: Selected clusters in regression data and, as we shall see later, the stability of the selected features.

Cluster Importance Score Features Cluster Importance Score Features

0.328 ‘I_2_2’ 0.256 ‘I_0_1’ 4. Cluster Interpretability and Stability
‘I_3_1’ ‘I_3_1’
‘I_4_0’ ‘I_11_1’ First, we take the Breast Cancer dataset ¹ as an example. This dataset is a binary classification dataset with target variables
‘I_6_2’ ‘I_13_1’ showing whether the cancer is malignant or benign, and 30 features which are characteristics of each of 569 medical images. The
‘I_8_0’ ‘I_15_1’ clustering algorithm groups those 30 features into eight clusters. The cluster importance scores, and the features within them,
‘I_9_0’ ‘I_18_1’ are listed in Table 3. Since these clusters have clearly human-interpretable themes, we also apply a descriptive “Topic” to them.
‘I_12_0’ ‘I_19_1’
‘I_14_0’ ‘N_10’
‘I_16_0’ ‘N_16’
‘I_17_0’ Table 3: Feature Clustering for Breast Cancer Dataset
‘I_19_1’ 0.212 ‘I_1_2’
‘I_2_2’
0.164 ‘I_0_1’ ‘I_5_2’ Topic Cluster Importance Scores Features
‘I_1_2’ ‘I_6_2’
‘I_5_2’ ‘I_7_2’ Geometry summary 0.360 ‘mean radius’
‘I_7_2’ ‘I_10_2’ ‘mean perimeter’
‘I_10_2’ ‘N_2’ ‘mean area’
‘I_11_1’ ‘mean compactness’
‘I_13_1’ 0.116 ‘I_4_0’ ‘mean concavity’
‘I_15_1’ ‘I_8_0’ ‘mean concave points’
‘I_18_1’ ‘I_9_0’ ‘radius error’
‘N_2’ ‘I_12_0’ ‘perimeter error’
‘I_14_0’ ‘area error’
‘I_16_0’ ‘worst radius’
‘I_17_0’ ‘worst perimeter’
‘N_6’ ‘worst area’
‘N_12’ ‘worst compactness’
‘worst concavity’
‘worst concave points’

Texture summary 0.174 ‘mean texture’

We can see that cMDA does a good job of grouping together related informative features, at least for the top cluster. This clustering
‘worst texture’
improves human interpretability, reduces the substitution effect and can potentially improve predictive accuracy.

For most of the datasets in this paper, the data is split into training sets, validation sets and testing sets in the ratio 60:20:20 Geometry error 0.112 ‘compactness error’
(some datasets are differently split and this is noted in the text). The model is trained and features are clustered in the training ‘concavity error’
set. The clusters are ranked in the validation set and then the features in the top clusters with above-average importance scores ‘concave points error’
are selected. This would be just the top cluster in both the synthetic classification and regression examples. Using the selected ‘fractal dimension error’
features, the prediction performance is evaluated on the testing set. In Table 2, we compare the out-of-sample results based on
the full feature set versus the selected feature subset. The cMDA approach outperforms the full set in both datasets. cMDA also
outperforms MDA in the classification dataset but underperforms it in the regression dataset. ¹ The data is taken from https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic).
www.technicalanalysts.com
RESEARCH RESEARCH
16 17

Smoothness error 0.092 ‘smoothness error’

Symmetry error 0.062 ‘symmetry error’
Texture error 0.056 ‘texture error’
Symmetry summary 0.055 ‘mean symmetry’
‘worst symmetry’

Fractal dimension 0.049 ‘mean fractal dimension’

‘worst fractal dimension’

Smoothness summary 0.042 ‘mean smoothness’

‘worst smoothness’

As the scores of clusters with topics ‘Geometry summary’ and ‘Texture summary’ are greater than the average of the 17 features,
these two clusters are selected. While individual feature importance results give ‘worst concave points’, ‘worst perimeter’, ‘worst
radius’, ‘mean concavity’, ‘area error’ and ‘worst texture’ as the most important features, we can easily see here that geometry of
the tumour is the most important cluster, while texture is the second most important.

The rank-based ‘instability’ of the cluster j is defined as its variance: Vj = Var(r1j, ... , rnj).
V(1)+ ... +V(k)
If we apply this only to the top k clusters, the ‘instability index’ is defined as: I = √ '
k
where V(k) is the variance of the kth-most important cluster. According to Figure 1, the instability index increases with k and
the most important cluster (Geometry summary) is ranked in 1st place for all 100 runs and the second most important cluster
(Texture summary) is ranked in 2nd place for 99 runs. Notably, the features selected from these two clusters are almost always
positioned in the top.

Figure 1: Instability Analysis for Breast Cancer Dataset

www.technicalanalysts.com
RESEARCH RESEARCH
18 19

After selecting the features from the top two clusters, we train a new random forest on the combined training and validation set Figure 2: Instability Analysis for S&P Dataset
and use that to make predictions on the testing set. We can see that cMDA has the best out-of-sample performance on AUC but
that F1 and Acc underperform non-clustering methods.

Table 4: Prediction Performance Comparison on Breast Cancer Dataset

F1 AUC Acc

cMDA 0.953 0.990 0.953

MDA 0.981 0.982 0.974
Full 0.954 0.980 0.939

Next, we conduct the analysis on predicting S&P 500 excess returns using economic factors, as discussed in Man and Chan
(2021). The data ranges from January 1945 to December 2019. Excess return is defined as the monthly SPX index return minus the
risk-free rate. The features are a set of fundamental and technical factors that include dividend price ratio (d/p), dividend yield
(d/y), earning price ratio (e/p), dividend payout ratio (d/e), stock variance (svar), book to market (b/m), net equity expansion (ntis),
T-Bill rate (tbl), long term yield (lty), long term return (ltr), term spread (tms), default yield spread (dfy), default return spread (dfr)
and inflation (infl). Fractional differentiation (López de Prado, 2018) is applied to all these features prior to the machine learning
process. The clustering algorithm groups these features into two clusters as shown in Table 5.

Table 5: Feature Clustering for S&P Dataset

Topic Cluster Scores Features

Fundamental 0.63 d/p, d/y, e/p, d/e, svar, ntis, ltr, tms
Technical 0.37 b/m, tbl. Lty, dfy, dfr, infl

As these clusters are highly human-interpretable, we again apply descriptive topics to them. The ‘Fundamental’ cluster contains
8 features and has higher importance score. The ‘Interest rate’ cluster contains 6 features. This cluster can also be called the
‘unimportant’ cluster, since we only have two clusters, and is not selected to train the final random forest model.

Figure 2 shows these two clusters are very stable. The instability index remains zero when involving either one or two clusters.
‘Fundamental’ and ‘Technical’ clusters are constantly ranked in the first and second places for all 100 runs.
www.technicalanalysts.com
RESEARCH RESEARCH
20 21

The data is from January 2013 to June 2020 with 160 features. We split the data into training/validation/testing sets over periods
2013-2017, 2018-2019 and 2020. cMDA groups the 160 features are into 44 clusters. Among them, eight clusters containing 81 features
with above-average importance scores are selected to train a new random forest model. Since the features are proprietary, we do
not display the clusters that identify them. Suffice to say that the top two clusters are highly human-interpretable, while the lower
ranked clusters are mixed bags of disparate features.

From Figure 3, the instability index increases with number of clusters and the most and second most important clusters are
steadily ranked in 1st and 2nd places respectively for all 100 runs. The third most important cluster is not as stable as the first two.
Given that the third cluster is a mixed bag of features of uninterpretable theme, this isn’t a surprising result.

Figure 3: Instability Analysis for Trading data

We split the data into training, validation and testing sets with the periods January 1945-December 2005, January 2006-December
2015 and January 2016-December 2019, respectively. The out-of-sample prediction performance on the testing set is summarised
as follows:

Table 6: Predictive Performance Comparison on S&P Dataset

F1 AUC Acc

cMDA 0.558 0.561 0.537

MDA 0.551 0.514 0.533 Table 7 shows the comparisons of ‘cMDA’ with 81 features in selected clusters, original ‘MDA’ with 20 selected features and ‘Full’
features set with total 160 features. As the top two clusters selected by ‘cMDA’ are intuitively interpretable, we also show the results
Full 0.432 0.461 0.483
of ‘cMDA(top 2)’ which contains 41 features from the top two clusters. The out-of-sample (test set) performance of a predictive
model based on cMDA significantly outperforms all others through F1, AUC and accuracy.

The metrics F1 score, AUC score and Accuracy obtained with the testing set are shown in Table 6. We can see that cMDA
outperforms MDA in out-of-sample prediction on all metrics for this dataset. Table 7: Prediction Performance Comparison

F1 AUC Acc
Application to Trading Strategy Meta-Labelling
cMDA 0.658 0.672 0.614
In this section, we apply clustering-based feature selection to a dataset with the labels equal to the sign of actual historical
returns of our proprietary Tail Reaper trading strategy. We want to see if this algorithm can select stable features and improve the cMDA (top 2) 0.595 0.640 0.571
trading performance. This application of financial machine learning is termed “meta-labelling” (López de Prado, 2018).
MDA 0.602 0.537 0.529
Full 0.481 0.416 0.414
² See www.qtscm.com/accounts for more details.
www.technicalanalysts.com
RESEARCH RESEARCH
22 23

Conclusions
Not a major top for Nasdaq-100
Ranking a cluster is more stable than ranking a feature and such stability enhances the model interpretability. It is also easier to
interpret the clusters by examining the common characteristics of the features contained within each cluster. For example, for
the S&P 500 excess returns dataset, we can identify the top cluster as fundamental indicators, while the second-ranked cluster
as mainly technical indicators.
Introduction
The clustering algorithm also improves the predictive performance over non-clustered MDA feature selection on the S&P 500
excess returns dataset and the proprietary Tail Reaper strategy returns dataset, though not on the synthetic datasets. Their As the US equity market is undergoing a correction in the last three weeks and as last
predictive performances on the Breast Cancer dataset are similar. month we mentioned that October to April is the seasonally bullish period, a review
of the leading sector of the US market, the Nasdaq100 is necessary to evaluate the
In this article, the clustering algorithm is driven by a correlation-based metric. As the distance matrix just need to satisfy non- strength of its uptrend.
negativity, identity, symmetric and sub-additivity, we may be able to improve the model performance by choosing other info-
theoretic metrics which also satisfy these conditions. We also chose hierarchical clustering instead of K-means. We will discuss
our reason for doing so in Appendix 1. Further work can also investigate whether clustering can improve the SHAP and LIME Still in Uptrend
feature-selection methods that we compared in Man and Chan (2021).
The Relative Strength (RS) of the Nasdaq100 versus the S&P500 (dotted green line
on the upper panel) has been stalling below its July top taking the form of a rising
triangle and in Oct 2020 at levels well above the low made in early September though
ACKNOWLEDGEMENTS having flattened. But it shows an uptrend In Oct as the chart does not display a lower
low. Therefore it is likely that the Technology sector mainly represented within the
We thank Radu Ciobanu, Sayooj Balakrishnan, and Roger Hunter for many useful suggestions and technical assistance. Bruno Estier CFTe Nasdaq100 is still a leading sector for the US equity market.
Bruno Estier is a Global Market Advisor
and Technical Analyst coach in Geneva,
References Switzerland for professional Traders and
Portfolio Managers.
• Hastie, T., R. Tobshirani and J. Friedman (2009) The elements of statistical learning. 2nd Edition, Spring.
• López de Prado, M. (2018) Advances in financial machine learning. John Wiley & Sons. Past President of the Swiss Association
• López de Prado, M. (2020) Machine learning for asset managers. Cambridge University Press.
• Man, X. and E.P. Chan (2021) The best way to select features: Comparing MDA, LIME and SHAP. Journal of Financial Data Science, Winter 2021. DOI: of Market Technicians (SAMT) for 12
https://doi.org/10.3905/jfds.2020.1.047 years, he served also as Chairman and
• Rousseeuw, P. (1987) Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Computational and Applied Mathematics 20: 53-65. as Secretary on the board of directors of
• Xiong, H., J. Wu and J. Chen (2008) K-means clustering versus validation measures: a data distribution perspective. IEEE Transactions on Systems, Man, and
Cybernetics, Part B 39(2): 318-331. IFTA. Bruno founded the French Society
of Technical Analysts (AFATE) in 1990. He
holds the Diploma from the STA and the
Appendix 1: Reasons to choose hierarchical clustering professional certifications from IFTA.

K-means uses the Euclidean distance metric and results in nearly identical cluster sizes , with a limited number of clusters. By contrast, hierarchical clustering can You can find his work at Bruno Estier
generate many more clusters and other forms of distance metrics can be used. For example, Jaccard similarity, which measures the distance between two binary Strategic Technicals, www.estier.net/
categorical variables, is not a Euclidean metric and cannot be applied to K-means clustering, but it can be used for hierarchical clustering. In the example datasets bruno [email protected] and
we studied, some clusters do contain more features than others. The number of features in each cluster should adapt to the nature of the features, and a tendency to The bullishness is not limited to a few large Technology stocks as we note that the
produce clusters of equal size is not desirable. he here shares his US equities outlook,
Relative Strength of Small Caps versus S&P500 (black line on upper panel) has been
written in November 2020 for Wealthgram.
bottoming and rose mid-September to October 2020, which is a classic bullish sign
of widening Breadth. Thus, the pullback of the Nasdaq100 in Nov is seen as a pause
Appendix 2: The algorithm for generating synthetic clusters in the Bull market than the beginning of a Bear market. This pullback relieves an
Suppose we want to generate a dataset of n samples with K synthetic clusters, m informative features, p noisy features, we can follow the procedures below: overbought situation, which was highlighted by its rise since May 2020 between the
first & second Bollinger band and by a spike in late August and a retest in October
1. Sample k centroids independently from Uniform(-10,10);
2. The number of informative features per cluster c₁, c₂, …, c K is [m/K] or [m/K]+1;
of the second upper Band. toward the Moving average 40 week (9931). The fear can
3. For the ith cluster, independently generate ci features by sampling n times per feature from a univariate normal distribution with mean equal to the value of the ith be noted on the VXN (orange dotted line on the upper panel) reaching the previous
centroid and with standard deviation as 0.5. In other words, we draw n× ci random numbers from N(ci, 0.5) to populate all features within the ith cluster for all the n spike high of September near 41.30 %.
samples;
4. For a classification model, randomly assign the label 0 or 1 with probability 0.5 to each of the sample. Create a random matrix Mm×m by sampling from Uniform(-1 ,1).
Form a product of M and the informative feature matrix of each class label X m×(n/2) to create MX m×(n/2). Stack the matrix of two classes to get MX m×n. In other words, we However , volatility above 40% in the VXN is rare and often signals a nearby low on the
map two different linear combinations of informative features to the two class labels, where the coefficients of the linear combinations are random but fixed over the underlying Equity index! So overall, in Nov 2020 it may well be time to be contrarian
samples with the same label; and not to panic along the classic price momentum indicators, like STOCHASTICS or
5. For a regression model, create a random matrix Mm×m by sampling from Uniform(-1 ,1). Form a product of M and the entire informative feature matrix X m×n to create
MX m×n. Create a random vector βm×1 by sampling from Uniform(0, 100), and then set the label for sample n as yn×1 = (MX m×n)T βm×1. In other words, we map a linear MACD on the lower panel which are crossing down. Of course, the Nasdaq100 needs
combination of informative features to the continuous labels, where the coefficients of the linear combinations are random but fixed over all samples; to display a move up to avoid breaking below the previous low of 10,677, ideally
6. Add the p×n noisy features matrix by sampling from a standard normal distribution. holding above 10,900 the rising former resistance trend line dating from October
2018. Such a rebound in price will validate the ranging pattern between 12,430 and
10,700, which medium term will open the door for higher prices toward 14,000. So,
³ Due to the ‘uniform effect’ proposed and discussed in Xiong, Wu and Chen (2008), K-mean tends to generate clusters with relatively uniform sizes. trend-followers beware!

RV The Macro Investing Tool Report Feb2025
No ratings yet
RV The Macro Investing Tool Report Feb2025
44 pages
Organizational Design - A Step-by-Step Approach
No ratings yet
Organizational Design - A Step-by-Step Approach
251 pages
Journal of Financial Data Science (Marco Lopez de Prado) (Z-Library)
100% (1)
Journal of Financial Data Science (Marco Lopez de Prado) (Z-Library)
175 pages
2015 KX250F KX450F FI Manual
No ratings yet
2015 KX250F KX450F FI Manual
79 pages
Argentina Report February 2025 Merrill Lynch
No ratings yet
Argentina Report February 2025 Merrill Lynch
79 pages
RTG Konecranes Model ENG Trainee Manual
No ratings yet
RTG Konecranes Model ENG Trainee Manual
13 pages
VLA 27 0.secured - Lab
No ratings yet
VLA 27 0.secured - Lab
144 pages
Software Project Management
100% (1)
Software Project Management
354 pages
Python Cookbook 3rd Edition David Beazley PDF Download
No ratings yet
Python Cookbook 3rd Edition David Beazley PDF Download
55 pages
Diameter Capital Partners - Q2 2022 Letter
No ratings yet
Diameter Capital Partners - Q2 2022 Letter
20 pages
Eis WR 2019 09 12
No ratings yet
Eis WR 2019 09 12
22 pages
Lesson 5 Investment Strategies
No ratings yet
Lesson 5 Investment Strategies
25 pages
Author Guidelines Online Submission
No ratings yet
Author Guidelines Online Submission
15 pages
Understanding Fischer Black
No ratings yet
Understanding Fischer Black
39 pages
EchoBeam-LR80 LR81 LR83 LR85-Manual
No ratings yet
EchoBeam-LR80 LR81 LR83 LR85-Manual
111 pages
Huawei Core Network Autonomous Driving Network White Paper
No ratings yet
Huawei Core Network Autonomous Driving Network White Paper
17 pages
The Rough Guide To China 7th Ed Andrew Commins Simon Foster David Leffman Instant Download
No ratings yet
The Rough Guide To China 7th Ed Andrew Commins Simon Foster David Leffman Instant Download
23 pages
Ey Smart Closing
No ratings yet
Ey Smart Closing
4 pages
Bsbinm601 Copy 1
75% (4)
Bsbinm601 Copy 1
13 pages
Is Amazon A Retailer, A Tech Firm, or A Media Company - How AI Can Help Investors Decide - HBS Working Knowledge
No ratings yet
Is Amazon A Retailer, A Tech Firm, or A Media Company - How AI Can Help Investors Decide - HBS Working Knowledge
3 pages
Family Offices in The Digital Age - Data and The Reporting Challenge
No ratings yet
Family Offices in The Digital Age - Data and The Reporting Challenge
8 pages
Natural Language Processing Investment Applications 1682008453
No ratings yet
Natural Language Processing Investment Applications 1682008453
60 pages
SaxoBankOutlook2008 Full
No ratings yet
SaxoBankOutlook2008 Full
38 pages
Sci Pap 01268
No ratings yet
Sci Pap 01268
10 pages
API INtegration Question and Answers
No ratings yet
API INtegration Question and Answers
36 pages
ECC204 Digital System Design Lab
No ratings yet
ECC204 Digital System Design Lab
34 pages
Key Square
No ratings yet
Key Square
18 pages
Financial Stability Report 20210506
No ratings yet
Financial Stability Report 20210506
80 pages
Modern Asset MGMT (Ssrn0
No ratings yet
Modern Asset MGMT (Ssrn0
11 pages
Papenbrock 2011, Asset Clustering
No ratings yet
Papenbrock 2011, Asset Clustering
102 pages
Applied AI For Finance and Accounting
No ratings yet
Applied AI For Finance and Accounting
19 pages
2017 Outrageous Predictions
100% (3)
2017 Outrageous Predictions
14 pages
The Conventional Wisdom: US Investment Strategy
No ratings yet
The Conventional Wisdom: US Investment Strategy
11 pages
Unit 4 Notes OOPs With Java (BCS-403)
No ratings yet
Unit 4 Notes OOPs With Java (BCS-403)
27 pages
Universal Hedging - Optimizing Currency Risk and Reward in International Equity Portfolios - PDF
No ratings yet
Universal Hedging - Optimizing Currency Risk and Reward in International Equity Portfolios - PDF
7 pages
FLAIRSStocks Preprint
No ratings yet
FLAIRSStocks Preprint
12 pages
Impact of Artificial Intelligence On Financial Markets
No ratings yet
Impact of Artificial Intelligence On Financial Markets
8 pages
PABIB7
No ratings yet
PABIB7
20 pages
Byron Wien - The Ten Surprises of 2014
No ratings yet
Byron Wien - The Ten Surprises of 2014
77 pages
16.1 Lesson: Putting It Into Practice - iSIKHNAS
No ratings yet
16.1 Lesson: Putting It Into Practice - iSIKHNAS
36 pages
EELU DS Week3 L1
No ratings yet
EELU DS Week3 L1
10 pages
Forecasting Financial Company
No ratings yet
Forecasting Financial Company
11 pages
Lessons For The Young Economist - Murphy
No ratings yet
Lessons For The Young Economist - Murphy
412 pages
Swords Wizardry 3rd Print Character Sheets
No ratings yet
Swords Wizardry 3rd Print Character Sheets
10 pages
Fiber Optics Termination Shortcuts
No ratings yet
Fiber Optics Termination Shortcuts
10 pages
CMPE361-Final - Sanple
No ratings yet
CMPE361-Final - Sanple
8 pages
Data Services Market Guide 2010
No ratings yet
Data Services Market Guide 2010
52 pages
Info Tech Sba 2025
No ratings yet
Info Tech Sba 2025
5 pages
2023 Howard Marks On Risk Revisited
No ratings yet
2023 Howard Marks On Risk Revisited
12 pages
1 s2.0 S1877050922008961 Main
No ratings yet
1 s2.0 S1877050922008961 Main
10 pages
CHAPTER 3-4 (Reviewer)
No ratings yet
CHAPTER 3-4 (Reviewer)
50 pages
BIS Interview 3
No ratings yet
BIS Interview 3
2 pages
Bridgewater
100% (2)
Bridgewater
17 pages
Data Analytics and The Auditor PDF
No ratings yet
Data Analytics and The Auditor PDF
8 pages
Garpwebcastmoodys062515 Unified Template
No ratings yet
Garpwebcastmoodys062515 Unified Template
32 pages
Sailing in The Storm in 2020 The Business To Consumer Case
No ratings yet
Sailing in The Storm in 2020 The Business To Consumer Case
11 pages
Snt-Tc-1a 2024 Changes
100% (1)
Snt-Tc-1a 2024 Changes
2 pages
Vendor Management System
No ratings yet
Vendor Management System
9 pages
Leveraging Advanced Analytics
No ratings yet
Leveraging Advanced Analytics
9 pages
MIT-101 Introduction To Information Technology
No ratings yet
MIT-101 Introduction To Information Technology
26 pages
Sample Quiz4
No ratings yet
Sample Quiz4
4 pages
Advanced Data Analytics For IT Auditors Joa Eng 1116
No ratings yet
Advanced Data Analytics For IT Auditors Joa Eng 1116
8 pages
Strategy Monthly Two Equity Rotations For A Post-Covid World
No ratings yet
Strategy Monthly Two Equity Rotations For A Post-Covid World
29 pages
Bwa M 03272018
100% (1)
Bwa M 03272018
8 pages
Application Items That Requires Bridgestone Brand Color
No ratings yet
Application Items That Requires Bridgestone Brand Color
4 pages
Q3 Explain, How Much Can Business Intelligence and Business Analytics Help Companies Refine Their Business Strategy?
No ratings yet
Q3 Explain, How Much Can Business Intelligence and Business Analytics Help Companies Refine Their Business Strategy?
7 pages
BCA - Gis So 2014 12 12 PDF
No ratings yet
BCA - Gis So 2014 12 12 PDF
32 pages
Solax X1-AC Charger Installation Guide
No ratings yet
Solax X1-AC Charger Installation Guide
2 pages
Unit 3 Notes
No ratings yet
Unit 3 Notes
34 pages
SIA Group 1 - Production Cycle
No ratings yet
SIA Group 1 - Production Cycle
3 pages
Quarterly Market Review: Second Quarter 2016
No ratings yet
Quarterly Market Review: Second Quarter 2016
15 pages
1.3 Risk+Taking+A+Corporate+Governance+Perspective+风险承担:公司治理的角度+ (2015考纲本节删除，选修)
No ratings yet
1.3 Risk+Taking+A+Corporate+Governance+Perspective+风险承担:公司治理的角度+ (2015考纲本节删除，选修)
25 pages
SONL 232-532 + 22232 K + H 3132: SONL Plummer Block Housings For Bearings On An Adapter Sleeve
No ratings yet
SONL 232-532 + 22232 K + H 3132: SONL Plummer Block Housings For Bearings On An Adapter Sleeve
5 pages
Data Analytics For External Auditors: What Data Analytics Can Do and How It Contributes To Audit Quality
No ratings yet
Data Analytics For External Auditors: What Data Analytics Can Do and How It Contributes To Audit Quality
6 pages
IMF World Economic and Financial Surveys Consolidated Multilateral Surveillance Report Sept 2011
No ratings yet
IMF World Economic and Financial Surveys Consolidated Multilateral Surveillance Report Sept 2011
12 pages
Asset Management Industry
No ratings yet
Asset Management Industry
11 pages
Tuning Time Series Queries in Finance: Case Studies and Recommendations
No ratings yet
Tuning Time Series Queries in Finance: Case Studies and Recommendations
7 pages
Dave Ramsey Financial Peace Workbook
No ratings yet
Dave Ramsey Financial Peace Workbook
2 pages
Einhorn Greenlight Capital July 07 2011
No ratings yet
Einhorn Greenlight Capital July 07 2011
7 pages
EuroHedge Summit Brochure - April 2011
No ratings yet
EuroHedge Summit Brochure - April 2011
12 pages
Current Edition Contains: 1: A Couple of Notes On Europe
No ratings yet
Current Edition Contains: 1: A Couple of Notes On Europe
11 pages
Jana q1 2014 Letter
No ratings yet
Jana q1 2014 Letter
10 pages
2012 Investment Strategy Guide
No ratings yet
2012 Investment Strategy Guide
16 pages
Hugh Hendry - MGR Commentary
No ratings yet
Hugh Hendry - MGR Commentary
12 pages
Breakfast With Dave 20100928
No ratings yet
Breakfast With Dave 20100928
8 pages
BlueCrest - State of NJ Proposal
No ratings yet
BlueCrest - State of NJ Proposal
3 pages
RCS Investments: Global Macro Fund Aggressive February 2014 Factsheet
No ratings yet
RCS Investments: Global Macro Fund Aggressive February 2014 Factsheet
2 pages
Druckenmiller Dealbook Conference 11-3-2015
No ratings yet
Druckenmiller Dealbook Conference 11-3-2015
1 page
Manager Guide Forecasting
No ratings yet
Manager Guide Forecasting
6 pages

Uploaded by

Uploaded by

www.technicalanalysts.

Introducing our new Compliance Officer:

3. Predictive Performance on Synthetic Data

Synthetic Classification Synthetic Regression F1 AUC Acc MSE MAE R2

Cluster Importance Score Features Cluster Importance Score Features

Texture summary 0.174 ‘mean texture’

Smoothness error 0.092 ‘smoothness error’

Fractal dimension 0.049 ‘mean fractal dimension’

Smoothness summary 0.042 ‘mean smoothness’

Figure 1: Instability Analysis for Breast Cancer Dataset

Table 4: Prediction Performance Comparison on Breast Cancer Dataset

cMDA 0.953 0.990 0.953

Table 5: Feature Clustering for S&P Dataset

Topic Cluster Scores Features

Figure 3: Instability Analysis for Trading data

Table 6: Predictive Performance Comparison on S&P Dataset

cMDA 0.558 0.561 0.537

You might also like