CLUBS+ is a centroid-based clustering algorithm that offers major advantages over current algorithms for around-centroids clustering:
  • Totally unsupervised
  • Significantly faster
  • High quality results

An example of how CLUBS+ works on a two-dimensional data set with 8 clusters (a): data are partitioned in the divisive step (b), then outliers are detected in intermediate refinement (c), clusters wrongly separated in the divisive step are recombined in the agglomerative step (d), and shape of clusters is made spherical/elliptical i the final refinement (e)
CLUBS+ works in four steps, combining features of hierarchical and partition-based algorithms:
  1. a binary space partion of data is created, obtaining a set of hyper-rectangular blocks
  2. the partitioning is refined by finding possible blocks containing only noise and reassigning data to blocks
  3. pairs of blocks are possibly merged, correcting the results of possible bad previous splits
  4. merged blocks are refined by reassigning data to the final clusters
The within clusters sum of squares (WCSS) is used to guide the partitioning and merging of blocks, and its analytical properties make the process very efficient.
The complexity of CLUBS+ is O(n × k), where n is the number of points and k is the number of clusters found at step 1.

CLUBS+ inherits from the previous version of CLUBS [Masciari et al., 2013] the idea to combine a divisive and an agglomerative approach. However, CLUBS+ provides important major improvements
  • The criterion used to stop both phases, based on the CH-index, is more accurate than the criterion used by CLUBS, based on the location of the elbow of the SSQ vs the number of clusters
  • CLUBS+ can isolate noise from clusters
  • CLUBS+ introduces some optimization that makes the algorithm more efficient

Experiments

We conducted extensive experiments to evaluate the performance of CLUBS+ in terms of (i) running time, (ii) quality of results, (iii) impact of dimensionality and (iv) effect of noise.
In the following, the links to the results of some of the conducted experiments are reported:

Source code

CLUBS+ was implemented in Java. The source code can be downloaded from here. Besides the implementation of CLUBS+, we provide the implementation of the old version of CLUBS and of three versions of k-means, using different seeding strategies: (i) random choice, (ii) k-means++, and (iii) k-meansCS, i.e., the centroids are initialized using the output of the agglomerative phase of CLUBS+.

The implementation of Birch used in our experiments is available in the NU-MineBench suite v. 3.0.1 [NU-MineBench].

The generator used for the syntetic datasets is available here. By running GenerateGaussian.java, without changes but the output path, you will obtain the same data sets that were used for the experiments.

References

[Fränti et al.] Iterative shrinking method for clustering problems. Pattern Recognition 2006, 39(5)

[KDDCup99] KDD Cup 1999 Data

[Masciari et al., 2013] A New, Fast and Accurate Algorithm for Hierarchical Clustering on Euclidean Distances. PAKDD (2) 2013

[NU-MineBench] Nu-MineBench suite v. 3.0.1 download page

[Spellman et al., 1998] Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Molecular Biology of the Cell 1998, 9(12)

[Taxi] NYC Taxi Trips

[Waterloo] University of Waterloo Image Repository

[Zhang et al., 1996] BIRCH: an efficient data clustering method for very large databases. SIGMOD 1996