OpenSubspace:
An Open Source Framework for Evaluation and Exploration of
Subspace Clustering Algorithms in WEKA
Subspace clustering and projected clustering are recent research areas
for clustering in high dimensional spaces. As the field is rather
young, there is a lack of comparative studies on the advantages and
disadvantages of the different algorithms. Part of the underlying
problem is the lack of available open source implementations that could
be used by researchers to understand, compare, and extend subspace and
projected clustering algorithms. We propose OpenSubspace, an open
source framework that meets these requirements. OpenSubspace integrates
state-of-the-art performance measures and visualization techniques to
foster research in subspace and projected clustering.
We currently use this framework both in our lectures
for
teaching and in our research projects for experimnet evaluation. Our
recent evaluation
study published at VLDB 2009 is based on
this framework. For further details please refer to our paper and to
the supplementary material
to this evaluation study. There, you can also find further details
about possible parametrization
of the underlying algorithms for running experiments.

Subspace Clustering
While subspace clustering is a rather young area that has been
researched for only one decade, several distinct paradigms can be
observed in the literature. Our system includes representatives of
these paradigms to provide an overview over the techniques available.
We provide implementations of the most recent approaches from different
paradigms:
- Cell-based
subspace clustering discretizes the data space for efficient
detection of dense grid cells in a bottom-up fashion. It was introduced
in the CLIQUE approach which exploits monotonicity
on the density of grid cells for pruning. SCHISM extends CLIQUE
using a variable threshold adapted to the dimensionality of the
subspace as well as efficient heuristics for pruning. In contrast, DOC and MINECLUS use
variable cells represented by hypercubes.
- Density-based subspace clustering
defines clusters as dense areas separated by sparsely populated areas.
In SUBCLU, a density monotonicity property is used
to prune subspaces in a bottom-up fashion. FIRES extends this
paradigm by using variable neighborhoods and an approximative
heuristicts for efficient computation. In INSCY we use
dimensionality unbias density, normalized with respect to the
dimensionality of the subspace and in addition in-process pruning of
redundant subspace clusters achieves meaningful result sizes.
- Clustering oriented methods optimize the
overall clustering result. PROCLUS extends the
k-medoid algorithm by iteratively refining a full-space k-medoid
clustering in a top-down manner. P3C combines
one-dimensional cluster cores to higher-dimensional clusters bottom-up.
STATPC uses a statistical test to remove redundant clusters out of the
result.
Acknowledgments for algorithm contributions:
We thank the authors of SUBCLU, FIRES and MINECLUS for providing us
with their original implementations.
Parameter Bracketing
Parameter setting is in general a difficult task for most
data mining algorithms. This is especially the case for unsupervised
techniques like clustering, where typically no prior knowledge about
the data is available. This inherent problem of clustering is even more
present in subspace clustering as the user has to provide
parametrization for detecting clusters in different subspaces. In
general the problem can be solved by guessing a parameter setting,
looking at the result and then trying to choose a better parameter
setting. To speed up this tedious process for users and give them more
information to base their parameter choice on, the system
visualizes a set of different subspace clusters at once, called
bracketing of different parameter settings for direct feedback. This
means that users obtain a series of MDS plots from which they pick the
most appropriate one(s) for subsequent runs of the subspace clustering
algorithms. By directly comparing the results of different parameter
settings, parametrization is no longer a guess, but becomes an informed
decision based on the visual analysis of the effects of parameters.
Moreover, this process is far more comfortable for users and allows
reaching the desired subspace clustering result in fewer steps.
Evaluation techniques
Quality of clustering or classification is usually measured in terms of
accuracy, i.e. the ratio of correctly classified or clustered objects.
For clustering, the "ground truth", i.e. the true clustering structure
of the data, is usually not known. In fact, it is the very goal of
clustering to detect this structure. As a consequence, clustering
algorithms are often evaluated manually, ideally with the help of
domain experts. However, domain experts are not always available, and
they might not agree on the quality of the result. Their assessment of
the clustering result is necessarily only based on the result itself,
it cannot be compared to the "optimum" which is not known. Moreover,
manual evaluation does not scale to large datasets or clustering result
outcomes.
For more realistic evaluation of clustering algorithms, large scale
analysis is therefore typically based on pre-labelled data, e.g. from
classification applications. The underlying assumption is that the
clustering structure typically reflects the class label assignment. At
least for relative comparisons of clustering algorithms, this provides
measures of the quality of the clustering result.
Our
system provides the measures proposed in recent subspace clustering
publications. In the figure below we present the evaluation output with
various measures for comparing subspace clustering results.
- F1-value: The F1-value is commonly used
in evaluation of classifiers and recently also for subspace or
projected clustering as well. It is computed as the harmonic mean of
recall ("are all clusters detected?") and precision ("are the clusters
accurately detected?"). The F1-value of the whole clustering is simply
the average of all F1-values.
- Entropy and Coverage: Corresponding
roughly to the measures of precision and recall, entropy accounts for
purity of the clustering, while coverage measures the size of the
clustering, i.e. the percentage of objects in any subspace cluster. The
System provides both coverage and entropy (for readability, inverse
entropy as a percentage).
- Accuracy: Accuracy of classifiers (e.g.
C4.5 decision tree) built on the detected patterns compared with the
accuracy of the same classifier on the original data is another quality
measure. It indicates to which extend the subspace clustering
successfully generalizes the underlying data distribution.
- CE and RNIA:
Up to now no measure accounts for the relevant dimensions of a subspace
cluster. However for synthetic data sets this information is available.
CE and RNIA use this additional information to compare not only the
object clusters but also the detected subspaces.
Overview Browsing
Interactive exploration starts from an overview of all mined
subspace clusters in which the user can browse. The automatically
detected patterns are thus presented to the user for a general
impression of the result and a comparison of the resulting subspace
clusters. As clusters are detected in arbitrary subspaces, they cannot
be compared based on the full space dimensionality. The system thus
incorporates a distance function that takes the main characteristics of
subspace clusters, their subspace dissimilarity and object
dissimilarity into account for visualization in an MDS plot. Based on
such an overall distance function, subspace clusters can be intuitively
represented as circles in a 2d or 3d space. This approximation of the
original high dimensional information to a 2d or 3d representation,
allows human users to easily understand the result just by the visual
impression. This MDS information gets enriched by additional visual
features. The size of a subspace cluster is represented as the diameter
of the circle. Its dimensionality is encoded by the color of the
circle.
This information allows users to identify similar subspace clusters,
those clusters of similar dimensionality, or of similar size, or to
study the overall distribution of these characteristics in the result
for further analysis.
For a more detailed analysis of two different parameter settings the
user can select two clusterings out of the presented series of MDS
plots by clicking on them in the bracketing representation. These two
subspace clusterings are then presented as larger plots in the lower
part of the cluster overview screen. Once again, detailed information
for the subspace clusters can be obtain by picking individual subspace
clusters.
Subspace cluster details
(not yet implemented)
Selecting a subspace cluster in this 2d or 3d representation by
clicking on it, the user gets more detailed information which is
presented in a second pop-up window. In this window, the data
distribution inside the selected subspace cluster is visualized via
box-plots. For each dimension the median, the 25/75% quantiles and
minimum/maximum values are presented. As a characteristic information
for subspace clusters, the relevant dimensions, i.e. the respective
subspace projections, are highlighted in red. For the irrelevant
dimensions, that are not part of the subspace projection, we may
observe a scattered distribution in the values, while in the relevant
dimensions the data is clustered to a tighter value range. Having this
information at hand, the user can investigate the quality and internal
structure of the grouping of the objects in such subspace clusters and
thus discover knowledge about the underlying structure of the data.
3d Browsing
The overview
browsing provides two static 2-dimensional MDS plots.
These static views provide a fixed perspective for easy comparison. For
in-depth browsing, where focusing to different parts of the subspace
clustering is of interest, a flexible navigation through MDS plots is
provided. 3-dimensional MDS plot browsing allows users to shift, rotate
and zoom into the MDS plot using standard 2-dimensional input devices
or 3-dimensional input devices that allow for even more intuitive
navigation. The user may thus identify similar or dissimilar subspace
clusters that are of specific interest.
Object Ranking
Upon
selection of an interesting subspace clustering from the 2d and 3d
overview browsing, the user can analyze this clustering in-depth on the
object level. In the overview MDS plots, the objects are summarized by
circles. For detail information on the objects and their
characteristics, the system provides an object ranking visualization
technique which both provides information on the values of the objects
in the cluster, as well as on the most relevant objects for the
subspace clustering. The color coding, illustrated in the figure on the
left uses the HSV color space, where the axes for saturation and value
correspond to the interestingness measure according to the subspace
clustering, i.e. interesting subspace clusters show up in bright and
intensive colors in their relevant dimensions. Their value is
represented as the hue axes in HSV colors. Irrelevant dimensions are
denoted in black for easier reading.
As subspace clusters may be large, the system allows for zooming into
areas of interest via a scroll bar to the right, which moves a zoom
area, depicted via yellow lines at top and bottom. Individual objects
can choose within this zoom area using the 2d or 3d input device. The
user is then provided with information on the actual values of the
object in question for most detailed analysis.
WEKA Integration and Usage of OpenSubspace
With OpenSubspace we provide an open source framework for the very
active research area of subspace clustering and projected clustering.
The aim of our framework is to establish a basis for comparable
experiments and thorough evaluations in the area of clustering on high
dimensional data. OpenSubspace is designed as the basis for comparative
studies on the advantages and disadvantages of different
subspace/projected clustering algorithms. In collaboration
with
the WEKA development team, we integrated OpenSubspace
into WEKA
3.5.8 by using the new plugin-architecture
of the Explorer. For
demonstration purposes we provide the complete WEKA toolkit including
the OpenSubspace framework on this website. A more detailed description of our framework can be found in our paper at Open
Source in Data Mining workshop. We will extend the
framework documentiation and its source code components as soon as
possible. Especially the available data sets for evaluation will be
extended and enriched with evaluation results.
We used and still use our framework for subspace clustering research
but also for education in advanced data mining courses. In both cases
we got positive feedback from our students who enjoyed easy and wide
access and the predefined interfaces in our framework. Furthermore, we
got encouraging feedback also by the community for our recent
demonstration system which integrates extensible mining techniques into
WEKA.
Providing OpenSubspace as open source, our framework can be used by
researchers and educators to understand, compare, and extend subspace
and projected clustering algorithms. The integrated state-of-the-art
performance measures and visualization techniques are first steps for a
thorough evaluation of algorithms in this field of data mining.
Future Work
- OpenSubspace can be seen as the natural basis for our
next
task. We plan to develop evaluation measures that meet the requirements
for a global quality rating of subspace clustering results. Evaluations
with the given measurements show that none of the measurements can
provide an overall rating of quality. Some measurements give
contradicting quality ratings on some data sets. Such effects show us
that further research should be done in this area.
- Visualization techniques give an overall impression on
the
groupings detected by the algorithms. However, further research of
meaningful and intuitive visualization is clearly necessary for
subspace clustering. The open source implementations of
subspace/projected mining algorithms and the framework might encourage
also researches in Visual Analytics or Human Computer Interaction to
develop more meaningful visualization and exploration techniques.
- For an overall evaluation framework OpenSubspace provides
algorithm and evaluation implementations. However, further work has to
be done to collect a bigger test set of high dimensional data sets. On
such a benchmarking set one could collect best parameter settings for
various algorithms, best quality results and screenshots of subspace
clustering result visualizations as example clusters on these data
sets. The aim of an overall evaluation framework with benchmarking data
will then lead to a more mature subspace/projected clustering research
field in which one can easily judge the quality of novel algorithms by
comparing it with approved results of competing approaches.
Resources
Note: Java 1.6 is required in order to run OpenSubspace.
Citation Information
If you publish material based on databases, algorithms or
evaluation measures
obtained from this repository, then, in your acknowledgments, please
note the assistance you received by using this repository. This will
help others to
obtain the same data sets, algorithms and evaluation measures and
replicate your experiments. We suggest
the following reference format for referring to this project:
Müller
E., Günnemann S., Assent I., Seidl T.:
Evaluating Clustering in Subspace Projections of High Dimensional Data
http://dme.rwth-aachen.de/OpenSubspace/
In Proc. 35th International Conference on Very Large Data Bases (VLDB
2009), Lyon, France. (2009)