Validating Clusters using Hopkins Statistics
Validating Clusters using Hopkins Statistics
Absfruct - A novel scheme for cluster validity using a test for herently statistical El]. A result is usually tested by building
random position hypothesis is proposed. The random position an altemative hypothesis and comparing it against a null hy-
hypothesis is tested against an alternative clustered hypothesis pothesis, HN. The null hypothesis is a statement of random-
on every cluster produced by a partitioning algorithm. A test ness and could be based on a random graph, a random label
statistic such as the well-known Hopkins statistic could be used
as a basis to accept or reject the random position hypothesis,
or a random position hypothesis. An alterative hypothesis, H,
which is also the null hypothesis in this case. The Hopkins statis- is then a statement of orderliness and captures the intent of
tic is known to be a fair estimator of randomness in a data set. the phrase - “the data are clustered”. The test is then one of
The concept is borrowed from the clustering tendency domain comparing Ho with 1-1, based on the value of some test statis-
and its applicability to validating clusters is shown here using tic T and deciding whether to accept or reject Ho with a cer-
two artificially constructed test data sets. tain degree of certainty. Hubert’s r statistic [2] and the Good-
man-Kruskrrl y statistic 131 are well known examples of test
I. INTRODUCTION statistics used for cluster validity studies. The I7 statistic
compares a clustering structure which is the result of a clus-
This paper examines a vital issue closely related to the
tering scheme, to an a-priori structure, such as one generated
problem of clustering, that of validating results of clustering
by using pre-defined category labels. The y statistic measures
procedures. Clustering attempts to partition a data set into self
the rank correlation between two ordinal sequences of num-
similar groups, called clusters. The literature is abound with
bers, one of which might be derived from the clustering struc-
references to clustering procedures and algorithms on a wide
ture and the other from an a-priori labeling scheme. For a de-
variety of data sets. on the other hand not much attention has tailed discussion on statistical cluster validity statistics, tests
been paid to related issues such as clustering tendency and
and indices, the reader is referred to [I].
cluster validity. Before subjecting a data set to partitioning.
Several non-statistical cluster validity indices and proce-
one needs to be confident whether the data set exhibits a pre- dures were independently developed within the hzzy cluster-
disposition to cluster into natural groups without identifying
ing commiinity to validate partitions obtained using fuzzy
the groups themselves [I] and this is what constitutes the
clustering algorithms. Prominent among these are the parti-
clustering tendency domain. A data set without inherent natu-
tion coefficient [4], classification entropy [5], proportion ex-
ral clusters could be thought of as a random collection of fea-
ponent [6], the uniform data functional [7], nonfuzziness in-
ture vectors and such random data sets should not be subject
dex [SI, information ratio [9], separation ratio [ IO] and per-
to parti tioning. Clustering tendency studies could also give
haps the best known of the validity indices, the Xic-Beni in-
information about the general nature of the data and hence
dex [I 11. These criteria measure the fuzzy overlap between
indicate which clustering procedure to use. if any, on the data
clusters, with or without regard to the geometrical properties
set. Another related issue is the validation of clustering re-
of the data set. Some indices work directly on the fuzzy clus-
sults, in a quantitative and objective manner. Clustering ten-
tering outpub while the others first convert the results to a
dency studies do not give any information about the number hard c-partition before evaluating it. Specialked parlitioning
of natural clusters in the data; it only provides (almost always
schemes such as shell partitioning require the use of special-
subjective) information about the presence of natural group-
ized validity indices such as the partition density and the shell
ings. Furthermore any partitioning algorithm would find any
thickness measure [12]. More recently advances have been
pre-specified number of clusters (ranging from 2 to n-1,
made in visual assessment of clustering using intensity dis-
where n is the number of patterns to be classified). irrespec-
plays 113.141. Some indices are known to function poorly
tive of the actual number of natural groups present. The best
across a wide range of data sets while others are specifically
and the most natural partition is the one that captures the in-
suited to a particular type of data set. In other words their de-
herent natural grouping in the data, and cluster validity aims
pendence on data sets and on the type of clustering scheme
at finding the most natural partition among the many parti-
employed, seriously hinder the practical usage of fuzzy valid-
tions generated by the clustering scheme. The issue of un-
ity indices.
knuwn nzmmber oj. cliisters also falls under the purview of
In this paper, we borrow a statistical concept from the
cluster validity studies.
field of clustering tendency and show its applicability to vali-
Statistical measures used to validate clustering results are
date clustering results generated in a partitioned data.
based on the premise that problems of cluster validity are in-
149
FUZZ-IEEE 2004
11. SPARSL SAMPLING TESTS AND THE HOPKTNS STATISTIC the average. be the same as the interpattem nearest neighbor
distances, implying randomness and hence H should be about
The problem of testing for clustering tendency can also 0.5. However when the pattems are aggregated or clustered
be described as problem of testing for spatial randomness. into clouds, the sampling origin to pattern nearest neighbor
Unlike statistic based cluster validity measures, a test for distances should, on the average, be larger than the randomly
clustering tendency is stated in terms of an intemal criterion selected interpattem nearest neighbor distances. so H should
and no a-priori information is brought into the analysis [l]. be larger than 0.5. almost equal to 1.0 for very well defined
The null hypothesis in most cases is a random position hy- clustered data. By the same reasoning, if is supposed to be
pothesis, such as, much less than 0.5 for regularly spaced data, data that are nei-
Ho: The patterns are generated by a Poisson process with ther clustered nor random. To ensure that no pattern is the
an intensity of L pattems per unit volume. neighbor of more than one sampling origin, m is chosen to be
Under Ho, the number of patterns falling in a region of substantially less than n; it has been suggested that m < 0.1 n
volume V has a Poisson distribution with mean L V and since [17]. With such a condition, it can be ensured that all 2m
L is constant and the numbers of pattems falling in disjoint nearest neighbor distances are statistically independent and H
regions of V are independent random variables, the Poisson has a beta distribution with parameters (m. m), independent of
process is a reasonable model for randomness (absence of both the intensity L and the dimensionality of the data set d.
structure). Sparse sampling tests have been shown to have The distribution and the density function of each of the tenns
high power against clustered alternative hypotheses. On the in (1) are also known, the individual sums each have a
other hand, tests based on small interpattem distances (such gamma distribution (assuming again that the nearest neighbor
as nearest neighbor distance tests) have low power against distances are all independent random variables). Our studies
clustered alternatives primarily because such tests depend on random data sets, clustered data sets and regularly spaced
heavily on the intensity L of the Poisson process assumed un- data sets show that tlopkins slatistic H consistently has a
der Ho. Other tests for spatial randomness include Scan tests value of around 0.5,0.7-0.99 and 0.01-0.3 respectively and is
[15], Quadrat analysis [I61 and Second moment structure hence a powerful estimator of randomness.
tests [ 171.
Sparse sampling tests are based on sampling origins ran- 111. RANDOMPOSITION TESTS FOR CLUSTER VALIDITY
domly identified in a sampling window. Several tests involv-
ing sampling origins have been proposed in literature. based A cluster is characterized by two main properlies -- com-
on a multitude of test statistics such as the Hopkins [18], Hol- pactness and isolation [l]. A natural cluster is ionr.sual(ri
gate [19.20], T-square [21], Eberhardt [22] and the Cox- compact and titn~siiallyisolated. A clustered data set is or-
Lewis statistic [23]. These statistics have been compared but dered because of the presence of natural clusters; in the ab-
little is known about any one of them outperforming the oth- sence of natural groups, it is a r'andom collection of data
ers. The Hopkins statistic is easy to use and comprehend and points. approximating a Poisson process distribution. In this
has been shown to be as good as the Holgate statistic [24]. section we show the applicability of the random position hy-
Let X = {x, 1 i=l to n ) is a collection of n patterns in a d- pothesis and use the Hopkins statistic as a measure for cluster
dimensional space such that s,= {.xlI. -U,., .., .v,d). Also, let Y = validity. Suppose a data set with 3 compact and isolated clus-
Cv, ij=l to m } be m sampling origins placed at random in the ters as shown in fig. 1, is subject to partitioning. At c - 2 ,
d-dimensioned sampling window, m << n. A sampling win- where c is the number of clusters to be found, most partition-
dow can be thought of as a subspace of the entire d- ing schemes would club clusters 11 and 111 together as one
dimensioned sample space. Two types of distances are de- cluster, say cluster A and identify cluster I as an independent
fined: a, as the minimum distance from ,v, to its nearest pat- cluster, B. A random position hypothesis test would lead to
t e n in X and w; as the minimum distance from a randomly the rejection of Ho for cluster A because it yet is a collection
selected pattern in X to its nearest neighbor (m out of the of clustered data points. However it would be difficult to re-
available n patterns are marked at random for this purpose). ject Hn in case of cluster B because it is the natural cluster,
The Hopkins statistic in ddiniensions is defined as, cluster 1. A natza'af clzrster hence apart @om heing isolated
and compact is also random within itseif: Intracluster data
might also exhibit some kind of mutual repulsion as in fig. 1
and in such a case the null hypothesis Ho should include a
statement conforming to random position as well as a non-
random non-clustered (regularly spaced) distribution. The
null hypothesis then can not be rejected if the data is either
random or regularly spaced. However in real world clustering
This statistic compares the nearest-neighbor distribution problems, this is rarely the case and so a random position hy-
of randomly selected locations to that for the randomly se- pothesis alone should suffice. At c = 3 however all the three
lected pattems. Under the null hypothesis, HI),the distances natural clusters are most likely to be identified during parti-
from the sampling origins to their nearest pattems should, on tion and hence it would be difficult to reject H,, for all the
150
25-29 July, 2004 Budapest, Hungary
three clusters identified. The rejection or acceptance of HI) being investigated. In case there were less than 10 pattems
depends on the value ofthe Hopkins statistic. At any value of assigned to a cluster, we choose nt = 1. The value of the Hop-
c > 3, any partitioning algorithm would either subdivide or kins statistic, H for each of the clusters is then calculated and
recombine the clusters that were produced by partitioning at c the process repeated over 5000 runs. The hard-partition is re-
= 3 and hence one might not have reason to reject He for any tained i f H does not fluctuate more than 0.05 over these 5000
of the clusters (in case clusters are subdivided further) or in runs and in case the fluctuation is more than 0.05, the FCM
some cases just enough evidence to reject He (in case clusters partitioning is repeated to get a better resultant hard-partition.
are recombined) for some of the generated clusters. The test The average of these 5000 values constitutes a consolidated
for cluster validity based on the random position hypothesis value of the Hopkins statistic for that particular cluster. The
can hence be stated as follows - accept the lowest value of c average partition Hopkins statistic is then calculated using (2)
at which it is impossible to reject the null hypothesis HOfor and the results plotted against the number of clusters, c.
all the clusters, the test applied one cluster at a time. Let Id, be The 4-cloud data is shown in fig. 2. The resultant aver-
the value of the Hopkins statistic for the ith cluster at a par- age partition Hopkins statistic, If(,,,is plotted against the num-
ticular level of clustering, the average value of the statistic is ber of clusters, c and shown in fig. 3. As can be seen, the null
then, hypothesis cannot be accepted for t = 2 (ffu,,= 0.76) and c = 3
(Hu* = 0.64). However it can be accepted with a fair degree of
1 ' confidence for c = 4 (ff, = 0.47) and thenceforth. Hence ac-
H , , = - E H , ,for fixed c
c , I
cording to the cluster validity criterion enunciated in section
111. one could accept c = 4 as the partition identifying the
A non-rejection of He, as given in section 11, would mean natural groupings in the data, which is indeed the case.
that on an average, the value of H,. is very close to 0.5. Pro-
ceeding from L' = 2 to c = n - 1, the lowest value of c that
generates Ha. 0.5, most likely generates a partition that
identifies the natural clusters in the data.
Iv. SIMULATIONS AND RESULTS
45 b
In this section we present some cluster validity studies
done on two artificially produced two-dimensional data sets,
(1) 4-cloud data comprising of 160 patterns and (2) 7-cloud
data comprising of 1400 patterns. The pattems are generated
within a pre-specified window using the C++ rand function,
which produces pseudo-random numbers using a seed initial-
ized by the CPU clock time. The data sets are partitioned us-
ing FCM [25]. ranging from c = 2 to c = 8 for the 4-cloud
data set and from c = 2 to c = 1 1 fbr the 7-cloud set. t I
. . I1
. .
. I . .
. . . Fig. 3. Thc 160 pattcm 4-cloud data sct (3 clusters of 50 pattcms each
. . . . .I11 and I clustcr of 10 pattcms)
. .
The 7-cioud data set is different from the 4-cloud data in
that the former is not as well separated as the latter. as shown
in fig. 4. The applicability of the Hopkins statistic as a cluster
validity index and the random position hypothesis test as an
Fig. 1 . A thrcc clustcr data set. (a) thc thrcc natural clusters. appropriate test for cluster validity, are illustrated in a much
(b) clusters idcntificd at c = 2. broader sense in this case. The two clusters in the lower left
hand corner of fig. 4 overlap each other and can be argued to
The fuzzy partitions are first converted into hard parti- be one big tilted 8-shaped cluster. This can be seen from the
tions and then all the generated clusters are subject to the plot of the average partition Hopkins statistic, HcI,.versus the
random position hypothesis test. The sampling window in all number of clusters in fig. 5; it is difficult to chose between c
cases encompasses the entire cluster set and the number of = 6 (Ha, = 0.53) and c = 7 (I&,, = 0.48). Other values of c can
sampling origins: m is chosen to be n / I O (or the closest inte- be rejected outright. Hence the Hopkins statistic does reflect
ger value), where n is the number of patterns in the cluster the nuances and subtleties in the data set. In the gbsence of
151
FUZZ-IEEE 2004
the overlap, the average partition Hopkins statistic would rightly rejected Hn for at least one cluster in the range 2 5 c I
have indicated a clear preference for c = 7, suggesting natural 6 and c = 7 would have been the smallest value of c where
grouping at that level of partitioning. one can not reject HOfor any ofthe 7 clusters.
L'
' ! 3 4 5 6 7 8 9 1 0 1 ] 1
Number of Clusters, c
v. DISCUSSION
AND CONCLLISIONS
152
2 5 2 9 July, 2004 Budapest, Hungary
statistic on very small data sets where each cluster might con- [SI J. C. Bedck. ”Mathematical models for systematics and taxonomy,” In
Proc. 8“’ Int. Conr A‘rinzrrical Taxonomy. Ed G , Estabrook. San Fran-
tain just 4-5 pattems each. At such intensities of pattem dis-
cisco. CA: Frccmim, pp. 143-166, 1975.
tribution, the theory of randomness does not hold and hence [6] M. P. Windham, “Clustcr validity for fuzzy clustcring algorithms,”
the random position hypothesis is meaningless. However this F u q Sets Svxt., vol. 5(2), pp. 177-185. 1981.
is a blessing in disguise because very small data sets are [7] M. P. Windham. “Cluster validity for fuzzy c-means clustcring algo-
rithm.” IEEE Trans. Puff.And. Machine hiteII., vol. PAM14(4), pp.
rarely encountered in real world situations; in fact it could be 357-363, 1982.
claimed that the random position test for cluster validity pro- [8] G. Libcrt and M. Roubcns, *'New cxpcrimcnhl results in cluster valid-
duces better results as the data set gets larger in size (with in- ity of fuzzy clustering algorithms,” In A h > Trend7 in Data Anal. utid
creasing .). The restriction however being that the data Appl., Eds J. iansscn, J.-I;. Macrotorchino and J.-M. Proth, Amsterdam,
The Ncthcrlands: North Holland. pp. 205-218, 1983.
should be isoluted in groups and compuct within the groups.
191 M. P. Windham, H. Bock and H. F. Walkcr, “Clustcring information
The theory can easily be extended to data in more than two- from convcrgcncc rate,” In Proc. 2”d Cm$ In/. Fedeiwlion Ciussificn-
dimensions because the Hopkins statistic is essentially de- tioir SOC., Washington D.C.. pp. 143, 1989.
fined in d-dimensions. One could even use other statistics, [IO] R. Guiidcrson. ”Application of fuzzy ISODATA algorithms to star
such as the Cox-Lewis statistic [23] which extends better in trackcr pointing systcms.“ In Pro<:.7” Tritcnnird Wodd IFAC Corzg.,
Helsinki, Finland,pp. 1319-1323. 1978.
d-dimensions (d > 2) than the Hopkins statistic. The random [I 11 X. 1.Xic and G. Bcni, “A validity measure for fuzzy clilstcriug,“ IEEE
position hypothesis test for cluster validity using an index Trcns. Pat. .4tiaI. Machine lntdl., vol. PAMI- 13(8). pp. 841 -847,
such as the Hopkins statistic not only tends to provide an an- 1991.
swer to the ever-elusive question - “how inup clusters to [ 121 R. N. Davf, “Validating f k z y partitions obtained through c-shcils clus-
tcring,” fatr. Recog. Lefrers.vol. 17, pp. 613-623, 1996.
find?” but also provides a Lalidation meawre for individual [I31 J. C. Bczdck and R. J. Hathaway. “VAT: A tool for visual asscssmcnt
clusters. If Ho cannot be rejected for any of the clusters at the of (clustcr) tendency.“ In P,ac. I./CXA‘, IEEE Press. Piscataway, NJ,
lowest c = c* partitioning, then one could argue that all the pp. 2225-2230.2002.
clusters found are in fact the true natural clusters in the data. [ 141 J. C. Rczdck and R. J. Hathaway, “Visual cluster validity displays for
prototype gcncrator clustering mcthods,” In Proc. fEEE Int. COCK
Not intended to replace the existing cluster validity tech- FUZZJ~ qvsf. (FUZZ-IEEE 2003/, pp. 87.5-880,2003.
niques and indices, the test for random hypothesis is an inter- [I?] 3. I. Naus, “Approximations for distributioiis of scan statistics,” JASA,
esting and promising addition to the repertoire of cluster va- vol. 17, pp. 177-183, 19x2.
lidity methodologies. [I61 R. Mead. “A test for spatial pattcm at several scales using data from a
grid of contiguous quadrats,” Biomefrics,vol. 30. pp. 295-308, 1974.
Future research might focus on applying the test directly [I71 B. D. Ripley, ‘-Modclling spatial patterns,” J. Royal Srati.c Soc.. B39,
on the fuzzy partition instead of having to defuzzify the c- pp. 172-212, 1977.
partition first. The fuzzy membership function can be used to [ 1x1 B. Hopkins, -.A ncw method of detcmiining thc typc of distribution of
weigh some (or all) of the distances used in the calculation of plant individuals,” Antzui.c ofBoiony, vol. 18, pp. 2 13-226, 1954.
[I 91 P. Holgate, “Some new tcsts o f randomness,” J. E d . . vol. 53, pp. 261-
the Hopkins statistic. The complete partition could then be 266,196.
evaluated at one go, instead of one cluster at a time. 1201 P. Holgatc, “Tcsts of randomness hascd on distance mcanucs;’ Bio-
metrika, vol. 52, pp.34.5-, 1965.
REFERENCES [21] J. E. Besag and J. T. Glcaves, “On thc detection of spatial pattem in
plant communitics,”f?rd/. fnt. Slotis/. insr., vol. 45. pp. 153-1 58, I 973.
[l] A. K. Jain and R. C Dubcs, Algonthnis for Clustcnng Data, Prcnticc [22] L. 1.Ebcrhardt. -'Some dcvclopmcnts in distance sampling,” Bionic~t-
Hall. Englcaood Cliffs, NJ. 1988. rics, vol. 23. pp. 207-2 16, 1967.
r2] L. J. Hubcrt and J. Schultz. ”Quadratic assignment as a general data- 1231 T. F. Cox and T. Lcwis, *’A conditioncd distance ratio mcthod for ana-
analysir strategy,” British .IMath. S/atrst. Psych., vol. 19, pp. 191 -24 l . lyzing spatial pattcms,” Biomenika, vol. 63, pp. 483-491. 1976.
1976. [24] E. Panayirci and K. C. Dubcs, “A t a t for multidimensional clustering
r3] 1. A. Goodman and W H. Kruskal, “Measures for nssociatton for tcndcncy,” Putt. Recog.,vol 16(4), pp. 433-434, 1983.
cross-classifications,” JASA, vol. 49. pp. 732-764, 1954. [251 J. C. Bezdck. Pattcm Recognition with Fuzzy Objective Function Algo-
[4] J. C. Bczdek. “Numcncal taxonomy with fuzzy sets,” ./. Mufk. B i d . , rithms. Plcnum, NY, 1981.
vol. I. pp. 57-71. 1974.
153