C MDA
C MDA
com
NEWS NEWS
10 11
and seven-day lookback periods are in the same cluster. That clearly identifies the cluster as “historical volatility”. In addition, features. As defined by López de Prado (2018), we have:
clustering may improve the stability of the important features, since their relative importance within a cluster won’t cause some
of them to be dropped, and it is less likely that the importance rank of a whole cluster of features will change drastically if we use 1) informative features that are used to determine the label; and
a different random seed. As discussed in Man & Chan (2021), stability of features improves interpretability. 2) noisy features that bear no information on determining the labels and are drawn from standard normal distributions.
The rest of this article is organised as follows: According to the descriptions in scikit-learn.org, the informative features are drawn independently from a standard normal
distribution. However, to introduce clusters into our data, we first randomly draw multiple ‘centroids’ and generate informative
• Section 2 introduces the cMDA algorithm and its use of hierarchical clustering to compute importance score at the cluster level; features around them from normal distributions centred around these centroids. We provide a detailed description of the
algorithm for creating these clustered features, and how they map to the labels, in Appendix 2.
• Section 3 compares the predictive performance of MDA vs cMDA using two synthetic datasets;
The dataset has 1,000 samples and 40 features comprised of 20 informative and 20 noisy features. These 20 informative features
• Section 4 compares the predictive performance of MDA vs cMDA on two popular datasets, including a financial dataset that form three clusters with six or seven features in each cluster.
uses technical and fundamental indicators to predict the S&P 500 stock index excess returns.
The selected features are analysed in Table 1. From Panel A, cMDA tends to keep all the informative features but it also includes a
• Finally, the algorithm is applied to our proprietary trading strategy returns dataset to see if it can identify interpretable small number of noisy features. In contrast, MDA chooses far fewer features but filters out all the noisy features. The downside is
clusters and improve the strategy’s performance. We find high stability and interpretability of the selected clusters in these that it also drops a lot of informative features. We may figuratively say that cMDA has a higher recall but lower precision than MDA.
financial applications, which should make machine learning employing this technique appealing to investors.
Denote ‘I_m_n’ as the m-th informative feature which is assigned to n-th synthetic cluster. For example, ‘I_20_2’ means the 20th
informative feature which belongs to the 2nd synthetic cluster. ‘N_m’ represents the m-th noisy feature. Panel B shows that all
2. cMDA using hierarchical clustering the features in the ‘0’ synthetic classification cluster are put into the most important selected cluster. The informative features in
‘1’ and ‘2’ synthetic clusters are not recovered by the algorithm since each of them has two features in the same selected cluster
Cluster-based feature selection consists of two steps: clustering features and ranking clusters. To begin clustering features, we and the rest of their features are grouped into another selected cluster. Panel C shows the synthetic regression data selects all the
define a distance matrix from the pair-wise correlations of the features Di,j)= √(½(1-pi,j). As discussed in López de Prado (2020), informative features of their original clusters to form the top two most important selected clusters, but each cluster also includes
the ideal distance matrix should be based on one of the information-theoretic metrics, but the correlation matrix is still the one one or two noisy features.
most commonly used in finance. The selection of distance matrix won’t affect the subsequent procedures, though it may affect
the predictive performance.
Next, a clustering method should be used to split the feature set into smaller sets according to the distance matrix. K-means and
hierarchical algorithms are popular clustering methods. The K-means clustering algorithm fixes the number K of clusters and the PREDICTNOW.AI
observations are assigned to each cluster based on the distance to the centre point. By contrast, hierarchical clustering works
in a ‘bottom-up’ manner. Starting from the bottom, every single feature is taken as a cluster. As we ascend to the next level, the
two closest clusters are merged. At the end of the process, all the features will be included in a single cluster. We then cut the
hierarchical tree at the proper level to create an optimal set of clusters. The outputs of hierarchical clustering have more structure Financial Machine Learning: No-code/API
and are more informative than the unstructured set of flat clusters returned by the K-means algorithm.
Computes the Probability of Profit for your next investment.
In the following analysis, we use the hierarchical algorithm as the clustering method:
The number of clusters is determined by finding the number (from 2 to the number of samples minus 1) that maximises the
“clustering quality” q. The clustering quality is related to the silhouette coefficient (Rousseeuw, 1987) which represents how similar
a sample is to samples in its own cluster compared with those in other clusters. For the data sample i, its silhouette coefficient
bi-ai
is defined as Si max {ai-bi
, where ai is the average distance between i and all other samples in the same cluster, and bi is the
average distance between i and all the samples in the nearest cluster of which i is not a member. Then for a given partition, the
E[S]
measure of clustering quality q is defined as q= Std[S] , where E[S] and Std[S] are the mean and variance of silhouette coefficients
for all samples in the training data.
After finding the optimal number of clusters based on maximising q and assigning the features to each cluster, the feature
importance algorithm is performed on the clusters rather than individual features. This means that during MDA feature selection,
all the features in a cluster are permuted at the same time, as described in López de Prado (2020). Since this article focuses on
how the clustering method can add value to model performance rather than comparison across different feature importance
algorithms, we omit presenting the implementation of clustered LIME and SHAP and only discuss clustered MDA. If a feature
is isolated by a cluster, MDA and clustered MDA are the same. The feature importance is measured by the rank-based score
proposed by Man & Chan (2021). As the importance score of a cluster is determined by the mean of the importance scores of the
features contained in it, a large cluster won’t necessarily be more important than a smaller cluster with fewer features.
Table 1: Selected Features on Synthetic Datasets Table 2: Prediction Performance Comparison on Synthetic Datasets
Panel A: Number of informative features selected by cMDA and MDA
Synthetic Classification Synthetic Regression
cMDA All 20 informative features, All 20 informative features, cMDA 0.975 0.998 0.973 545460.60 585.39 0.9626
1 noisy feature 5 noisy features
MDA 0.960 0.996 0.957 436676.44 510.08 0.9700
MDA Only 11 informative features, Only 9 informative features,
Full 0.975 0.995 0.973 718870.06 662.79 0.9607
0 noisy feature 0 noisy feature
Given that the predictive performances of cMDA and MDA are close, cMDA should be favoured given the increase in interpretability
Panel B: Selected clusters in classification data Panel C: Selected clusters in regression data and, as we shall see later, the stability of the selected features.
For most of the datasets in this paper, the data is split into training sets, validation sets and testing sets in the ratio 60:20:20 Geometry error 0.112 ‘compactness error’
(some datasets are differently split and this is noted in the text). The model is trained and features are clustered in the training ‘concavity error’
set. The clusters are ranked in the validation set and then the features in the top clusters with above-average importance scores ‘concave points error’
are selected. This would be just the top cluster in both the synthetic classification and regression examples. Using the selected ‘fractal dimension error’
features, the prediction performance is evaluated on the testing set. In Table 2, we compare the out-of-sample results based on
the full feature set versus the selected feature subset. The cMDA approach outperforms the full set in both datasets. cMDA also
outperforms MDA in the classification dataset but underperforms it in the regression dataset. ¹ The data is taken from https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic).
www.technicalanalysts.com
RESEARCH RESEARCH
16 17
As the scores of clusters with topics ‘Geometry summary’ and ‘Texture summary’ are greater than the average of the 17 features,
these two clusters are selected. While individual feature importance results give ‘worst concave points’, ‘worst perimeter’, ‘worst
radius’, ‘mean concavity’, ‘area error’ and ‘worst texture’ as the most important features, we can easily see here that geometry of
the tumour is the most important cluster, while texture is the second most important.
The rank-based ‘instability’ of the cluster j is defined as its variance: Vj = Var(r1j, ... , rnj).
V(1)+ ... +V(k)
If we apply this only to the top k clusters, the ‘instability index’ is defined as: I = √ '
k
where V(k) is the variance of the kth-most important cluster. According to Figure 1, the instability index increases with k and
the most important cluster (Geometry summary) is ranked in 1st place for all 100 runs and the second most important cluster
(Texture summary) is ranked in 2nd place for 99 runs. Notably, the features selected from these two clusters are almost always
positioned in the top.
After selecting the features from the top two clusters, we train a new random forest on the combined training and validation set Figure 2: Instability Analysis for S&P Dataset
and use that to make predictions on the testing set. We can see that cMDA has the best out-of-sample performance on AUC but
that F1 and Acc underperform non-clustering methods.
F1 AUC Acc
Next, we conduct the analysis on predicting S&P 500 excess returns using economic factors, as discussed in Man and Chan
(2021). The data ranges from January 1945 to December 2019. Excess return is defined as the monthly SPX index return minus the
risk-free rate. The features are a set of fundamental and technical factors that include dividend price ratio (d/p), dividend yield
(d/y), earning price ratio (e/p), dividend payout ratio (d/e), stock variance (svar), book to market (b/m), net equity expansion (ntis),
T-Bill rate (tbl), long term yield (lty), long term return (ltr), term spread (tms), default yield spread (dfy), default return spread (dfr)
and inflation (infl). Fractional differentiation (López de Prado, 2018) is applied to all these features prior to the machine learning
process. The clustering algorithm groups these features into two clusters as shown in Table 5.
Fundamental 0.63 d/p, d/y, e/p, d/e, svar, ntis, ltr, tms
Technical 0.37 b/m, tbl. Lty, dfy, dfr, infl
As these clusters are highly human-interpretable, we again apply descriptive topics to them. The ‘Fundamental’ cluster contains
8 features and has higher importance score. The ‘Interest rate’ cluster contains 6 features. This cluster can also be called the
‘unimportant’ cluster, since we only have two clusters, and is not selected to train the final random forest model.
Figure 2 shows these two clusters are very stable. The instability index remains zero when involving either one or two clusters.
‘Fundamental’ and ‘Technical’ clusters are constantly ranked in the first and second places for all 100 runs.
www.technicalanalysts.com
RESEARCH RESEARCH
20 21
The data is from January 2013 to June 2020 with 160 features. We split the data into training/validation/testing sets over periods
2013-2017, 2018-2019 and 2020. cMDA groups the 160 features are into 44 clusters. Among them, eight clusters containing 81 features
with above-average importance scores are selected to train a new random forest model. Since the features are proprietary, we do
not display the clusters that identify them. Suffice to say that the top two clusters are highly human-interpretable, while the lower
ranked clusters are mixed bags of disparate features.
From Figure 3, the instability index increases with number of clusters and the most and second most important clusters are
steadily ranked in 1st and 2nd places respectively for all 100 runs. The third most important cluster is not as stable as the first two.
Given that the third cluster is a mixed bag of features of uninterpretable theme, this isn’t a surprising result.
We split the data into training, validation and testing sets with the periods January 1945-December 2005, January 2006-December
2015 and January 2016-December 2019, respectively. The out-of-sample prediction performance on the testing set is summarised
as follows:
F1 AUC Acc
The metrics F1 score, AUC score and Accuracy obtained with the testing set are shown in Table 6. We can see that cMDA
outperforms MDA in out-of-sample prediction on all metrics for this dataset. Table 7: Prediction Performance Comparison
F1 AUC Acc
Application to Trading Strategy Meta-Labelling
cMDA 0.658 0.672 0.614
In this section, we apply clustering-based feature selection to a dataset with the labels equal to the sign of actual historical
returns of our proprietary Tail Reaper trading strategy. We want to see if this algorithm can select stable features and improve the cMDA (top 2) 0.595 0.640 0.571
trading performance. This application of financial machine learning is termed “meta-labelling” (López de Prado, 2018).
MDA 0.602 0.537 0.529
Full 0.481 0.416 0.414
² See www.qtscm.com/accounts for more details.
www.technicalanalysts.com
RESEARCH RESEARCH
22 23
Conclusions
Not a major top for Nasdaq-100
Ranking a cluster is more stable than ranking a feature and such stability enhances the model interpretability. It is also easier to
interpret the clusters by examining the common characteristics of the features contained within each cluster. For example, for
the S&P 500 excess returns dataset, we can identify the top cluster as fundamental indicators, while the second-ranked cluster
as mainly technical indicators.
Introduction
The clustering algorithm also improves the predictive performance over non-clustered MDA feature selection on the S&P 500
excess returns dataset and the proprietary Tail Reaper strategy returns dataset, though not on the synthetic datasets. Their As the US equity market is undergoing a correction in the last three weeks and as last
predictive performances on the Breast Cancer dataset are similar. month we mentioned that October to April is the seasonally bullish period, a review
of the leading sector of the US market, the Nasdaq100 is necessary to evaluate the
In this article, the clustering algorithm is driven by a correlation-based metric. As the distance matrix just need to satisfy non- strength of its uptrend.
negativity, identity, symmetric and sub-additivity, we may be able to improve the model performance by choosing other info-
theoretic metrics which also satisfy these conditions. We also chose hierarchical clustering instead of K-means. We will discuss
our reason for doing so in Appendix 1. Further work can also investigate whether clustering can improve the SHAP and LIME Still in Uptrend
feature-selection methods that we compared in Man and Chan (2021).
The Relative Strength (RS) of the Nasdaq100 versus the S&P500 (dotted green line
on the upper panel) has been stalling below its July top taking the form of a rising
triangle and in Oct 2020 at levels well above the low made in early September though
ACKNOWLEDGEMENTS having flattened. But it shows an uptrend In Oct as the chart does not display a lower
low. Therefore it is likely that the Technology sector mainly represented within the
We thank Radu Ciobanu, Sayooj Balakrishnan, and Roger Hunter for many useful suggestions and technical assistance. Bruno Estier CFTe Nasdaq100 is still a leading sector for the US equity market.
Bruno Estier is a Global Market Advisor
and Technical Analyst coach in Geneva,
References Switzerland for professional Traders and
Portfolio Managers.
• Hastie, T., R. Tobshirani and J. Friedman (2009) The elements of statistical learning. 2nd Edition, Spring.
• López de Prado, M. (2018) Advances in financial machine learning. John Wiley & Sons. Past President of the Swiss Association
• López de Prado, M. (2020) Machine learning for asset managers. Cambridge University Press.
• Man, X. and E.P. Chan (2021) The best way to select features: Comparing MDA, LIME and SHAP. Journal of Financial Data Science, Winter 2021. DOI: of Market Technicians (SAMT) for 12
https://doi.org/10.3905/jfds.2020.1.047 years, he served also as Chairman and
• Rousseeuw, P. (1987) Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Computational and Applied Mathematics 20: 53-65. as Secretary on the board of directors of
• Xiong, H., J. Wu and J. Chen (2008) K-means clustering versus validation measures: a data distribution perspective. IEEE Transactions on Systems, Man, and
Cybernetics, Part B 39(2): 318-331. IFTA. Bruno founded the French Society
of Technical Analysts (AFATE) in 1990. He
holds the Diploma from the STA and the
Appendix 1: Reasons to choose hierarchical clustering professional certifications from IFTA.
K-means uses the Euclidean distance metric and results in nearly identical cluster sizes , with a limited number of clusters. By contrast, hierarchical clustering can You can find his work at Bruno Estier
generate many more clusters and other forms of distance metrics can be used. For example, Jaccard similarity, which measures the distance between two binary Strategic Technicals, www.estier.net/
categorical variables, is not a Euclidean metric and cannot be applied to K-means clustering, but it can be used for hierarchical clustering. In the example datasets bruno [email protected] and
we studied, some clusters do contain more features than others. The number of features in each cluster should adapt to the nature of the features, and a tendency to The bullishness is not limited to a few large Technology stocks as we note that the
produce clusters of equal size is not desirable. he here shares his US equities outlook,
Relative Strength of Small Caps versus S&P500 (black line on upper panel) has been
written in November 2020 for Wealthgram.
bottoming and rose mid-September to October 2020, which is a classic bullish sign
of widening Breadth. Thus, the pullback of the Nasdaq100 in Nov is seen as a pause
Appendix 2: The algorithm for generating synthetic clusters in the Bull market than the beginning of a Bear market. This pullback relieves an
Suppose we want to generate a dataset of n samples with K synthetic clusters, m informative features, p noisy features, we can follow the procedures below: overbought situation, which was highlighted by its rise since May 2020 between the
first & second Bollinger band and by a spike in late August and a retest in October
1. Sample k centroids independently from Uniform(-10,10);
2. The number of informative features per cluster c₁, c₂, …, c K is [m/K] or [m/K]+1;
of the second upper Band. toward the Moving average 40 week (9931). The fear can
3. For the ith cluster, independently generate ci features by sampling n times per feature from a univariate normal distribution with mean equal to the value of the ith be noted on the VXN (orange dotted line on the upper panel) reaching the previous
centroid and with standard deviation as 0.5. In other words, we draw n× ci random numbers from N(ci, 0.5) to populate all features within the ith cluster for all the n spike high of September near 41.30 %.
samples;
4. For a classification model, randomly assign the label 0 or 1 with probability 0.5 to each of the sample. Create a random matrix Mm×m by sampling from Uniform(-1 ,1).
Form a product of M and the informative feature matrix of each class label X m×(n/2) to create MX m×(n/2). Stack the matrix of two classes to get MX m×n. In other words, we However , volatility above 40% in the VXN is rare and often signals a nearby low on the
map two different linear combinations of informative features to the two class labels, where the coefficients of the linear combinations are random but fixed over the underlying Equity index! So overall, in Nov 2020 it may well be time to be contrarian
samples with the same label; and not to panic along the classic price momentum indicators, like STOCHASTICS or
5. For a regression model, create a random matrix Mm×m by sampling from Uniform(-1 ,1). Form a product of M and the entire informative feature matrix X m×n to create
MX m×n. Create a random vector βm×1 by sampling from Uniform(0, 100), and then set the label for sample n as yn×1 = (MX m×n)T βm×1. In other words, we map a linear MACD on the lower panel which are crossing down. Of course, the Nasdaq100 needs
combination of informative features to the continuous labels, where the coefficients of the linear combinations are random but fixed over all samples; to display a move up to avoid breaking below the previous low of 10,677, ideally
6. Add the p×n noisy features matrix by sampling from a standard normal distribution. holding above 10,900 the rising former resistance trend line dating from October
2018. Such a rebound in price will validate the ranging pattern between 12,430 and
10,700, which medium term will open the door for higher prices toward 14,000. So,
³ Due to the ‘uniform effect’ proposed and discussed in Xiong, Wu and Chen (2008), K-mean tends to generate clusters with relatively uniform sizes. trend-followers beware!