OpenSubspace: Weka Subspace-Clustering Integration

An Open Source Framework for Evaluation and Exploration of Subspace Clustering Algorithms in WEKA

Emmanuel Müller, Ira Assent, Stephan Günnemann, Timm Jansen and Thomas Seidl



Subspace clustering and projected clustering are recent research areas for clustering in high dimensional spaces. As the field is rather young, there is a lack of comparative studies on the advantages and disadvantages of the different algorithms. Part of the underlying problem is the lack of available open source implementations that could be used by researchers to understand, compare, and extend subspace and projected clustering algorithms. We propose OpenSubspace, an open source framework that meets these requirements. OpenSubspace integrates state-of-the-art performance measures and visualization techniques to foster research in subspace and projected clustering. We currently use this framework both in our lectures for teaching and in our research projects for experimnet evaluation. Our recent evaluation study published at VLDB 2009 is based on this framework. For further details please refer to our paper and to the supplementary material to this evaluation study. There, you can also find further details about possible parametrization of the underlying algorithms for running experiments.


Upcoming Events:
Currently, we are organizing a workshop on subspace clustering topics. The 2nd MultiClust Workshop on Discovering, Summarizing and Using Multiple Clusterings will be held in conjunction with ECML PKDD 2011. The aim of this workshop is to establish a venue for the growing community interested in multiple clustering solutions. It should increase the visibility of the topic itself but also bridge it to closely related research areas such as subspace clustering, projected clustering, ensemble clustering, co-clustering, clustering with constraints, and frequent pattern mining. As a platform for exchange of ideas, the workshop should attract both newcomers and experts in this area.





Subspace Clustering

While subspace clustering is a rather young area that has been researched for only one decade, several distinct paradigms can be observed in the literature. Our system includes representatives of these paradigms to provide an overview over the techniques available. We provide implementations of the most recent approaches from different paradigms:
Clustering
  • Cell-based subspace clustering discretizes the data space for efficient detection of dense grid cells in a bottom-up fashion. It was introduced in the CLIQUE approach which exploits monotonicity on the density of grid cells for pruning. SCHISM extends CLIQUE using a variable threshold adapted to the dimensionality of the subspace as well as efficient heuristics for pruning. In contrast, DOC and MINECLUS use variable cells represented by hypercubes.
  • Density-based subspace clustering defines clusters as dense areas separated by sparsely populated areas. In SUBCLU, a density monotonicity property is used to prune subspaces in a bottom-up fashion. FIRES extends this paradigm by using variable neighborhoods and an approximative heuristicts for efficient computation. In INSCY we use dimensionality unbias density, normalized with respect to the dimensionality of the subspace and in addition in-process pruning of redundant subspace clusters achieves meaningful result sizes.
  • Clustering oriented methods optimize the overall clustering result. PROCLUS extends the k-medoid algorithm by iteratively refining a full-space k-medoid clustering in a top-down manner. P3C combines one-dimensional cluster cores to higher-dimensional clusters bottom-up. STATPC uses a statistical test to remove redundant clusters out of the result.

    Acknowledgments for algorithm contributions:

    We thank the authors of SUBCLU, FIRES and MINECLUS for providing us with their original implementations.

Parameter Bracketing

Parameter setting is in general a difficult task for most data mining algorithms. This is especially the case for unsupervised techniques like clustering, where typically no prior knowledge about the data is available. This inherent problem of clustering is even more present in subspace clustering as the user has to provide parametrization for detecting clusters in different subspaces. In general the problem can be solved by guessing a parameter setting, looking at the result and then trying to choose a better parameter setting. To speed up this tedious process for users and give them more information to base their parameter choice on, the system visualizes a set of different subspace clusters at once, called bracketing of different parameter settings for direct feedback. This means that users obtain a series of MDS plots from which they pick the most appropriate one(s) for subsequent runs of the subspace clustering algorithms. By directly comparing the results of different parameter settings, parametrization is no longer a guess, but becomes an informed decision based on the visual analysis of the effects of parameters. Moreover, this process is far more comfortable for users and allows reaching the desired subspace clustering result in fewer steps.

Evaluation techniques

Quality of clustering or classification is usually measured in terms of accuracy, i.e. the ratio of correctly classified or clustered objects. For clustering, the "ground truth", i.e. the true clustering structure of the data, is usually not known. In fact, it is the very goal of clustering to detect this structure. As a consequence, clustering algorithms are often evaluated manually, ideally with the help of domain experts. However, domain experts are not always available, and they might not agree on the quality of the result. Their assessment of the clustering result is necessarily only based on the result itself, it cannot be compared to the "optimum" which is not known. Moreover, manual evaluation does not scale to large datasets or clustering result outcomes.
For more realistic evaluation of clustering algorithms, large scale analysis is therefore typically based on pre-labelled data, e.g. from classification applications. The underlying assumption is that the clustering structure typically reflects the class label assignment. At least for relative comparisons of clustering algorithms, this provides measures of the quality of the clustering result.

Our system provides the measures proposed in recent subspace clustering publications. In the figure below we present the evaluation output with various measures for comparing subspace clustering results.
  • F1-value: The F1-value is commonly used in evaluation of classifiers and recently also for subspace or projected clustering as well. It is computed as the harmonic mean of recall ("are all clusters detected?") and precision ("are the clusters accurately detected?"). The F1-value of the whole clustering is simply the average of all F1-values.
  • Entropy and Coverage: Corresponding roughly to the measures of precision and recall, entropy accounts for purity of the clustering, while coverage measures the size of the clustering, i.e. the percentage of objects in any subspace cluster. The System provides both coverage and entropy (for readability, inverse entropy as a percentage).
  • Accuracy: Accuracy of classifiers (e.g. C4.5 decision tree) built on the detected patterns compared with the accuracy of the same classifier on the original data is another quality measure. It indicates to which extend the subspace clustering successfully generalizes the underlying data distribution.
  • CE and RNIA: Up to now no measure accounts for the relevant dimensions of a subspace cluster. However for synthetic data sets this information is available. CE and RNIA use this additional information to compare not only the object clusters but also the detected subspaces.

Overview Browsing

Interactive exploration starts from an overview of all mined subspace clusters in which the user can browse. The automatically detected patterns are thus presented to the user for a general impression of the result and a comparison of the resulting subspace clusters. As clusters are detected in arbitrary subspaces, they cannot be compared based on the full space dimensionality. The system thus incorporates a distance function that takes the main characteristics of subspace clusters, their subspace dissimilarity and object dissimilarity into account for visualization in an MDS plot. Based on such an overall distance function, subspace clusters can be intuitively represented as circles in a 2d or 3d space. This approximation of the original high dimensional information to a 2d or 3d representation, allows human users to easily understand the result just by the visual impression. This MDS information gets enriched by additional visual features. The size of a subspace cluster is represented as the diameter of the circle. Its dimensionality is encoded by the color of the circle. This information allows users to identify similar subspace clusters, those clusters of similar dimensionality, or of similar size, or to study the overall distribution of these characteristics in the result for further analysis.
For a more detailed analysis of two different parameter settings the user can select two clusterings out of the presented series of MDS plots by clicking on them in the bracketing representation. These two subspace clusterings are then presented as larger plots in the lower part of the cluster overview screen. Once again, detailed information for the subspace clusters can be obtain by picking individual subspace clusters.



Subspace cluster details

(not yet implemented)
Selecting a subspace cluster in this 2d or 3d representation by clicking on it, the user gets more detailed information which is presented in a second pop-up window. In this window, the data distribution inside the selected subspace cluster is visualized via box-plots. For each dimension the median, the 25/75% quantiles and minimum/maximum values are presented. As a characteristic information for subspace clusters, the relevant dimensions, i.e. the respective subspace projections, are highlighted in red. For the irrelevant dimensions, that are not part of the subspace projection, we may observe a scattered distribution in the values, while in the relevant dimensions the data is clustered to a tighter value range. Having this information at hand, the user can investigate the quality and internal structure of the grouping of the objects in such subspace clusters and thus discover knowledge about the underlying structure of the data.

3d Browsing

The overview browsing provides two static 2-dimensional MDS plots. These static views provide a fixed perspective for easy comparison. For in-depth browsing, where focusing to different parts of the subspace clustering is of interest, a flexible navigation through MDS plots is provided. 3-dimensional MDS plot browsing allows users to shift, rotate and zoom into the MDS plot using standard 2-dimensional input devices or 3-dimensional input devices that allow for even more intuitive navigation. The user may thus identify similar or dissimilar subspace clusters that are of specific interest.

Object Ranking

Upon selection of an interesting subspace clustering from the 2d and 3d overview browsing, the user can analyze this clustering in-depth on the object level. In the overview MDS plots, the objects are summarized by circles. For detail information on the objects and their characteristics, the system provides an object ranking visualization technique which both provides information on the values of the objects in the cluster, as well as on the most relevant objects for the subspace clustering. The color coding, illustrated in the figure on the left uses the HSV color space, where the axes for saturation and value correspond to the interestingness measure according to the subspace clustering, i.e. interesting subspace clusters show up in bright and intensive colors in their relevant dimensions. Their value is represented as the hue axes in HSV colors. Irrelevant dimensions are denoted in black for easier reading. As subspace clusters may be large, the system allows for zooming into areas of interest via a scroll bar to the right, which moves a zoom area, depicted via yellow lines at top and bottom. Individual objects can choose within this zoom area using the 2d or 3d input device. The user is then provided with information on the actual values of the object in question for most detailed analysis.

WEKA Integration and Usage of OpenSubspace

With OpenSubspace we provide an open source framework for the very active research area of subspace clustering and projected clustering. The aim of our framework is to establish a basis for comparable experiments and thorough evaluations in the area of clustering on high dimensional data. OpenSubspace is designed as the basis for comparative studies on the advantages and disadvantages of different subspace/projected clustering algorithms. In collaboration with the WEKA development team, we integrated OpenSubspace into WEKA 3.5.8 by using the new plugin-architecture of the Explorer. For demonstration purposes we provide the complete WEKA toolkit including the OpenSubspace framework on this website. A more detailed description of our framework can be found in our paper at Open Source in Data Mining workshop. We will extend the framework documentiation and its source code components as soon as possible. Especially the available data sets for evaluation will be extended and enriched with evaluation results.

We used and still use our framework for subspace clustering research but also for education in advanced data mining courses. In both cases we got positive feedback from our students who enjoyed easy and wide access and the predefined interfaces in our framework. Furthermore, we got encouraging feedback also by the community for our recent demonstration system which integrates extensible mining techniques into WEKA. Providing OpenSubspace as open source, our framework can be used by researchers and educators to understand, compare, and extend subspace and projected clustering algorithms. The integrated state-of-the-art performance measures and visualization techniques are first steps for a thorough evaluation of algorithms in this field of data mining.

Future Work

  • OpenSubspace can be seen as the natural basis for our next task. We plan to develop evaluation measures that meet the requirements for a global quality rating of subspace clustering results. Evaluations with the given measurements show that none of the measurements can provide an overall rating of quality. Some measurements give contradicting quality ratings on some data sets. Such effects show us that further research should be done in this area.
  • Visualization techniques give an overall impression on the groupings detected by the algorithms. However, further research of meaningful and intuitive visualization is clearly necessary for subspace clustering. The open source implementations of subspace/projected mining algorithms and the framework might encourage also researches in Visual Analytics or Human Computer Interaction to develop more meaningful visualization and exploration techniques.
  • For an overall evaluation framework OpenSubspace provides algorithm and evaluation implementations. However, further work has to be done to collect a bigger test set of high dimensional data sets. On such a benchmarking set one could collect best parameter settings for various algorithms, best quality results and screenshots of subspace clustering result visualizations as example clusters on these data sets. The aim of an overall evaluation framework with benchmarking data will then lead to a more mature subspace/projected clustering research field in which one can easily judge the quality of novel algorithms by comparing it with approved results of competing approaches.

Resources

Note: Java 1.6 is required in order to run OpenSubspace.
Executables and Sources:
(including WEKA 3.5.8)
OpenSubspace.zip
Data sets and cluster models: data.zip
Videotutorial: weka_subspaceclustering.avi

Citation Information

If you publish material based on databases, algorithms or evaluation measures obtained from this repository, then, in your acknowledgments, please note the assistance you received by using this repository. This will help others to obtain the same data sets, algorithms and evaluation measures and replicate your experiments. We suggest the following reference format for referring to this project:
Müller E., Günnemann S., Assent I., Seidl T.:
Evaluating Clustering in Subspace Projections of High Dimensional Data

http://dme.rwth-aachen.de/OpenSubspace/

In Proc. 35th International Conference on Very Large Data Bases (VLDB 2009), Lyon, France. (2009)