External Evaluation Measures for Subspace Clustering

by Stephan Günnemann, Ines Färber, Emmanuel Müller, Ira Assent and Thomas Seidl

in Proc. 20th ACM Conference on Information and Knowledge Management (CIKM 2011), Glasgow, UK (2011)

 

 

Supplementary Material

Documentation of our software for subspace clustering evaluation measures:
experiment repeatability and usage of our evaluation framework

This website conveniently documents the experimental setup used in the evaluation described in our manuscript "External Evaluation Measures for Subspace Clustering". We provide all experimental data, setup, and software. Based on this evaluation framework one can use all evaluation measures described in our manuscript for a thorough evaluation of your own subspace clustering approach (please note the citation information). For usage of our evaluation framework, we provide additional information for three main usage scenarios:

  • Repeatability of our experimental results given the executables and all datasets used in our manuscript.
  • Easy and thorough evaluation of subspace clustering techniques integrated into the WEKA framework
  • Usage of our evaluation framework for evaluation of subspace clustering approaches in future research work.

In the following we explain for each of these scenarios the usage of evaluation measures with screenshots and a thorough description of required data formats.

Repeatability of experiments

For repeatability of our experiments we included all used datasets as well as run batch files for each figure in our experiment section. As depicted in the screenshot below one can run each measured data point in our experiments as a single run in the console. We decided to aggregate all runs of one figure in a batch file such that the output is aggregated in one output file. Overall, we provide all executables and datasets that are required to repeat our experiments. Furthermore, we describe in the following section how to extend this evaluation with further experiments or use it for evaluation of subspace clustering approaches in the future.

 

Extending the WEKA framework for evaluation of subspace clustering

For an easy evaluation we integrated all evaluation measures discussed in our manuscript into the popular WEKA framework. In previous work, we therefore extended the framework to subspace and projected clustering. A short description of this extension and how to use it can be found on our OpenSubspace project website. Initially this project has been designed for interactive exploration of clustering results [Morpheus: Interactive exploration of subspace clustering (demonstration system) presented at KDD 2008]. Further extensions included open interfaces for an easy extensibility of our framework [OpenSubspace framework presented at OSDM workshop at PAKDD 2009]. For our submitted work we extended this framework in the direction of novel evaluation measures. Using our framework one can perform all evaluation measurements presented in our paper on arbitrary data sets given in the ARFF data format. For our thorough evaluation based on our evaluation scenarios we additionally added hidden and found cluster files as presented in our submission. Furthermore, we included recently used evaluation measures for subspace clustering, especially our novel evaluation measure "E4SC".

 

 

Example for usage as evaluation software in future research work

In the following, we explain the main requirements for the future usage of our evaluation framework in your own subspace clustering evaluation. We describe main file formats and execution procedure.

1) File format of database to be clustered.

The data is simply described via the Arff-File format. Each row after the @data tag corresponds to one object.
Beside the dimensions of the data an additional class label is required (appended at the end). This label can be used to encode further information about the object. Thus, the first x dimensions are used for clustering while the last (class dimension) is skipped.
Please keep in mind that the class label is necessary and has to be included.

database.arff

%
Exemplary database with 8 objects
%
% The first 4 dimensions are used for clustering (i.e. we have a 4-dimensional feature space).
%
% The last dimension indicates a label that can be used to store further information about the objects. This dimension is skipped for the actual clustering task! However, keep in mind that this additional class label is necessary for a correct parsing of the database file.
%
% The hidden clusters (ground truth) are stored in hiddenClusters.true

@relation database
@attribute dim0 real
@attribute dim1 real
@attribute dim2 real
@attribute dim3 real
@attribute class {-1, 1, 2, 13, 23}
@data
1.0, 2.0, 10.0, 7.0, 13
1.0, 2.0, 20.0, 7.0, 13
1.0, 2.0, 30.0, 10.0, 1
1.0, 2.0, 40.0, 20.0, 1
10.0, 5.0, 5.0, 30.0, 2
20.0, 5.0, 5.0, 7.0, 23
30.0, 5.0, 5.0, 7.0, 23
40.0, 10.0, 50.0, 40.0, -1

 

2) File format of hidden clusters (e.g. synthetically generated) and identified clusters (e.g. output of an algorithm).

The first line contains header information (required; arbitrary text). Each of the following lines describes one hidden/identified subspace cluster.

Given an x-dimensional database, the first x entries are a binary vector that represents the relevant dimensions of the subspace cluster (0=dimension is irrelevant, 1=dimension is relevant). The next entry represents the number of objects in the cluster followed by the actual IDs of the corresponding objects in ascending order.
Counting of IDs starts with 0.

Example of a subspace cluster with relevant dimensions {0,1,3} and 4 objects in a 6-dimensional database:
1 1 0 1 0 0 4 7 11 13 17

hiddenClusters.true
Header (required);Ground truth indicates 3 clusters with 4, 3, and 4 objects resp.
1 1 0 0 4 0 1 2 3
0 1 1 0 3 4 5 6
0 0 0 1 4 0 1 5 6

 

outputOfAlgorithm.log
Header (required);Algorithm has identified 2 clusters with 5 and 2 objects resp.
1 1 0 1 5 0 1 2 3 7
0 1 1 0 2 4 6

 

3) Evaluation of subspace clustering results

To evaluate the clustering result (outputOfAlgorithm.log) of an algorithm with respect to a ground truth (hiddenClusters.true) of a given database (database.arff) the following command has to be executed:

run.bat
java -cp i9-weka.jar;weka.jar; i9-subspace.jar;Jama.jar;jsc.jar;commons-math-1.1.jar;vecmath.jar; j3dcore.jar;j3dutils.jar weka.subspaceClusterer.MeasureEvaluation -t database.arff -c last -M F1_P;F1_R;F1_Merge;Accuracy;RNIA;CE;E4SC -TRUEFILE hiddenClusters.true -LOGFILE outputOfAlgorithm.log



The list F1_P;F1_R;F1_Merge;Accuracy;RNIA;CE;E4SC indicates the applied measures. Subsets of applied measures are allowed, for example E4SC;CE;RNIA

Please keep in mind that for some operation systems the ; has to be replaced by a :



Resources

Executables and datasets for our experiments: executables.zip
Example files for future usage: example.zip



Citation Information


If you publish material based on databases, algorithms, parameter settings or evaluation measures obtained from this repository, then, in your acknowledgments, please note the assistance you received by using this repository. This will help others to obtain the same data sets, algorithms, parameter settings and evaluation measures and replicate your experiments. We suggest the following reference format for referring to this project:

 

Günnemann S., Färber I., Müller E., Assent I., Seidl T.:
External Evaluation Measures for Subspace Clustering

http://dme.rwth-aachen.de/OpenSubspace/E4SC/

In Proc. 20th ACM Conference on Information and Knowledge Management (CIKM 2011), Glasgow, UK. (2011)