OpenSubspace:

An Open Source Framework for Evaluation and Exploration of Subspace Clustering Algorithms in WEKA


Emmanuel Müller, Ira Assent, Stephan Günnemann, Timm Jansen and Thomas Seidl



Subspace clustering and projected clustering are recent research areas for clustering in high dimensional spaces. As the field is rather young, there is a lack of comparative studies on the advantages and disadvantages of the different algorithms. Part of the underlying problem is the lack of available open source implementations that could be used by researchers to understand, compare, and extend subspace and projected clustering algorithms. We propose OpenSubspace, an open source framework that meets these requirements. OpenSubspace integrates state-of-the-art performance measures and visualization techniques to foster research in subspace and projected clustering. We currently use this framework both in our lectures for teaching and in our research projects for experimnet evaluation. Our recent evaluation study published at VLDB 2009 is based on this framework. For further details please refer to our paper and to the supplementary material to this evaluation study. There, you can also find further details about possible parametrization of the underlying algorithms for running experiments.




Subspace Clustering

While subspace clustering is a rather young area that has been researched for only one decade, several distinct paradigms can be observed in the literature. Our system includes representatives of these paradigms to provide an overview over the techniques available. We provide implementations of the most recent approaches from different paradigms:

Clustering



Parameter Bracketing

Parameter setting is in general a difficult task for most data mining algorithms. This is especially the case for unsupervised techniques like clustering, where typically no prior knowledge about the data is available. This inherent problem of clustering is even more present in subspace clustering as the user has to provide parametrization for detecting clusters in different subspaces. In general the problem can be solved by guessing a parameter setting, looking at the result and then trying to choose a better parameter setting. To speed up this tedious process for users and give them more information to base their parameter choice on, the system visualizes a set of different subspace clusters at once, called bracketing of different parameter settings for direct feedback. This means that users obtain a series of MDS plots from which they pick the most appropriate one(s) for subsequent runs of the subspace clustering algorithms. By directly comparing the results of different parameter settings, parametrization is no longer a guess, but becomes an informed decision based on the visual analysis of the effects of parameters. Moreover, this process is far more comfortable for users and allows reaching the desired subspace clustering result in fewer steps.


Evaluation techniques

Quality of clustering or classification is usually measured in terms of accuracy, i.e. the ratio of correctly classified or clustered objects. For clustering, the "ground truth", i.e. the true clustering structure of the data, is usually not known. In fact, it is the very goal of clustering to detect this structure. As a consequence, clustering algorithms are often evaluated manually, ideally with the help of domain experts. However, domain experts are not always available, and they might not agree on the quality of the result. Their assessment of the clustering result is necessarily only based on the result itself, it cannot be compared to the "optimum" which is not known. Moreover, manual evaluation does not scale to large datasets or clustering result outcomes.
For more realistic evaluation of clustering algorithms, large scale analysis is therefore typically based on pre-labelled data, e.g. from classification applications. The underlying assumption is that the clustering structure typically reflects the class label assignment. At least for relative comparisons of clustering algorithms, this provides measures of the quality of the clustering result.

Our system provides the measures proposed in recent subspace clustering publications. In the figure below we present the evaluation output with various measures for comparing subspace clustering results.



Overview Browsing

Interactive exploration starts from an overview of all mined subspace clusters in which the user can browse. The automatically detected patterns are thus presented to the user for a general impression of the result and a comparison of the resulting subspace clusters. As clusters are detected in arbitrary subspaces, they cannot be compared based on the full space dimensionality. The system thus incorporates a distance function that takes the main characteristics of subspace clusters, their subspace dissimilarity and object dissimilarity into account for visualization in an MDS plot. Based on such an overall distance function, subspace clusters can be intuitively represented as circles in a 2d or 3d space. This approximation of the original high dimensional information to a 2d or 3d representation, allows human users to easily understand the result just by the visual impression. This MDS information gets enriched by additional visual features. The size of a subspace cluster is represented as the diameter of the circle. Its dimensionality is encoded by the color of the circle. This information allows users to identify similar subspace clusters, those clusters of similar dimensionality, or of similar size, or to study the overall distribution of these characteristics in the result for further analysis.
For a more detailed analysis of two different parameter settings the user can select two clusterings out of the presented series of MDS plots by clicking on them in the bracketing representation. These two subspace clusterings are then presented as larger plots in the lower part of the cluster overview screen. Once again, detailed information for the subspace clusters can be obtain by picking individual subspace clusters.



Subspace cluster details

(not yet implemented)
Selecting a subspace cluster in this 2d or 3d representation by clicking on it, the user gets more detailed information which is presented in a second pop-up window. In this window, the data distribution inside the selected subspace cluster is visualized via box-plots. For each dimension the median, the 25/75% quantiles and minimum/maximum values are presented. As a characteristic information for subspace clusters, the relevant dimensions, i.e. the respective subspace projections, are highlighted in red. For the irrelevant dimensions, that are not part of the subspace projection, we may observe a scattered distribution in the values, while in the relevant dimensions the data is clustered to a tighter value range. Having this information at hand, the user can investigate the quality and internal structure of the grouping of the objects in such subspace clusters and thus discover knowledge about the underlying structure of the data.


3d Browsing

The overview browsing provides two static 2-dimensional MDS plots. These static views provide a fixed perspective for easy comparison. For in-depth browsing, where focusing to different parts of the subspace clustering is of interest, a flexible navigation through MDS plots is provided. 3-dimensional MDS plot browsing allows users to shift, rotate and zoom into the MDS plot using standard 2-dimensional input devices or 3-dimensional input devices that allow for even more intuitive navigation. The user may thus identify similar or dissimilar subspace clusters that are of specific interest.


Object Ranking

Upon selection of an interesting subspace clustering from the 2d and 3d overview browsing, the user can analyze this clustering in-depth on the object level. In the overview MDS plots, the objects are summarized by circles. For detail information on the objects and their characteristics, the system provides an object ranking visualization technique which both provides information on the values of the objects in the cluster, as well as on the most relevant objects for the subspace clustering. The color coding, illustrated in the figure on the left uses the HSV color space, where the axes for saturation and value correspond to the interestingness measure according to the subspace clustering, i.e. interesting subspace clusters show up in bright and intensive colors in their relevant dimensions. Their value is represented as the hue axes in HSV colors. Irrelevant dimensions are denoted in black for easier reading. As subspace clusters may be large, the system allows for zooming into areas of interest via a scroll bar to the right, which moves a zoom area, depicted via yellow lines at top and bottom. Individual objects can choose within this zoom area using the 2d or 3d input device. The user is then provided with information on the actual values of the object in question for most detailed analysis.


WEKA Integration and Usage of OpenSubspace

With OpenSubspace we provide an open source framework for the very active research area of subspace clustering and projected clustering. The aim of our framework is to establish a basis for comparable experiments and thorough evaluations in the area of clustering on high dimensional data. OpenSubspace is designed as the basis for comparative studies on the advantages and disadvantages of different subspace/projected clustering algorithms. In collaboration with the WEKA development team, we integrated OpenSubspace into WEKA 3.5.8 by using the new plugin-architecture of the Explorer. For demonstration purposes we provide the complete WEKA toolkit including the OpenSubspace framework on this website. A more detailed description of our framework can be found in our paper at Open Source in Data Mining workshop. We will extend the framework documentiation and its source code components as soon as possible. Especially the available data sets for evaluation will be extended and enriched with evaluation results.

We used and still use our framework for subspace clustering research but also for education in advanced data mining courses. In both cases we got positive feedback from our students who enjoyed easy and wide access and the predefined interfaces in our framework. Furthermore, we got encouraging feedback also by the community for our recent demonstration system which integrates extensible mining techniques into WEKA. Providing OpenSubspace as open source, our framework can be used by researchers and educators to understand, compare, and extend subspace and projected clustering algorithms. The integrated state-of-the-art performance measures and visualization techniques are first steps for a thorough evaluation of algorithms in this field of data mining.

Future Work




Resources

Note: Java 1.6 is required in order to run OpenSubspace.
Executables and Sources:
(including WEKA 3.5.8)
OpenSubspace.zip
Data sets and cluster models: data.zip
Videotutorial: weka_subspaceclustering.avi


Citation Information

If you publish material based on databases, algorithms or evaluation measures obtained from this repository, then, in your acknowledgments, please note the assistance you received by using this repository. This will help others to obtain the same data sets, algorithms and evaluation measures and replicate your experiments. We suggest the following reference format for referring to this project:



Müller E., Günnemann S., Assent I., Seidl T.:
Evaluating Clustering in Subspace Projections of High Dimensional Data

http://dme.rwth-aachen.de/OpenSubspace/

In Proc. 35th International Conference on Very Large Data Bases (VLDB 2009), Lyon, France. (2009)