Evaluating Clustering in Subspace Projections of High Dimensional Data

In. Proc. 35th International Conference on Very Large Data Bases (VLDB 2009), Lyon, France. (2009)

Emmanuel Müller, Stephan Günnemann, Ira Assent and ThomasSeidl

 

Supplementary material concerning repeatability

On this website we provide supplementary matirial enabling repeatability of our experiments. Please refer to our paper for further details about the clustering paradigms and evaluation measures we used. We focus on this website only on the underlying implementation in the OpenSubspace framework and the used parameter settings for our evaluation.

 

Using the WEKA framework for repeatability  

For an easy repeatability we integrated all algorithms for clustering in subspace projections of high dimensional data into the popular WEKA framework. We therefore extended the framework to subspace clustering. A short description of this extension and how to use it can be found on our OpenSubspace website. It includes a video tutorial giving a short introduction. Using our framework one can perform all evaluation measurements presented in our paper. Furthermore, one can interactively explore the clustering results. Our ongoing work is focusing on this open framework which we develop in contact with the WEKA development team. 

 

Repeatability expectations

We have reasonable expectations regarding the similarity of the experiment results. There will be made two observations, one by us as the authors of this paper and one by other researchers repeating our experiments. These two observations should be similar; however they also might slightly varying from each other. For instance, there is no way to ensure that our hardware used in the experiments is available to other researchers repeating the evaluation. Therefore, we do not expect measured execution times to match those reported in the paper, but rather roughly similar curve tendencies. When measuring other things than running time, such as e.g. result sizes in experiments with no randomized component, we do expect to obtain the results presented in the paper. Some of the approaches, however, include randomized components and thus will only show in the average case similar results, single runs might vary. Randomized components are e.g. random initialization of cluster centers in the PROCLUS algorithm.

 

Parameter settings

For repeatability, names of parameters are as in the original publications. Please refer to the original publications for a more detailed description. Some parameters have not been described or named in the publications; therefore we tried to give them as meaningful names as possible. Most of these not further described parameters are included as we use the original implementations provided by the authors of SUBCLU, FIRES, INSCY and MINECLUS. For example, FIRES has its main parameters (K, MINCLU and MU) while pre- and post-processing parameters are only of minor interest. We have optimized all parameters for each algorithm on each data set and listed them for repeatability on this website.

Resources

Note: Java 1.6 is required in order to run OpenSubspace.

Executables and Sources:
(including WEKA 3.5.8)
OpenSubspace.zip
Data sets and cluster models: data.zip
Videotutorial: weka_subspaceclustering.avi

 

Citation Information

If you publish material based on databases, algorithms, parameter settings or evaluation measures obtained from this repository, then, in your acknowledgments, please note the assistance you received by using this repository. This will help others to obtain the same data sets, algorithms, parameter settings and evaluation measures and replicate your experiments. We suggest the following reference format for referring to this project:

Müller E., Günnemann S., Assent I., Seidl T.:
Evaluating Clustering in Subspace Projections of High Dimensional Data

http://dme.rwth-aachen.de/OpenSubspace/evaluation/

In Proc. 35th International Conference on Very Large Data Bases (VLDB 2009), Lyon, France. (2009) 





Parameter Settings for repeatability of our experiments:

1. Synthetic Data:

Cell-Based Paradigm
CLIQUE
 FromOffsetOpStepsTo
TAU 0.001 10 * 3 0.1
XI 5 5 + 6 30

Total number of experiments: 18


DOC
 FromOffsetOpStepsTo
ALPHA 0.001 10 * 3 0.1
BETA 0.1 0.1 + 4 0.4
MAXITER 1024 0 + 1 1024
k 2 2 * 6 64
w 50 2 * 3 200

Total number of experiments: 216


MINECLUS
 FromOffsetOpStepsTo
ALPHA 0.001 10 * 3 0.1
BETA 0.1 0.1 + 4 0.4
MAXOUT -1 0 + 1 -1.0
k 2 2 * 6 64
numBins 1 0 + 1 1
w 50 2 * 3 200

Total number of experiments: 216


SCHISM
 FromOffsetOpStepsTo
TAU 1.0E-12 1000 * 5 1.0
XI 5 5 + 6 30
u 0.05 0 + 1 0.05

Total number of experiments: 30



Density-Based Paradigm
SUBCLU
 FromOffsetOpStepsTo
epsilon 10 2 * 6 320
minPoints 2 2 * 5 32

Total number of experiments: 30


FIRES
 FromOffsetOpStepsTo
BASE_DBSCAN_EPSILON 1.0 0 + 1 1.0
BASE_DBSCAN_MINPTS 6 0 + 1 6
GRAPH_K 1 3 + 4 10
GRAPH_MINCLU 1 1 + 4 4
GRAPH_MU 1 3 + 4 10
GRAPH_SPLIT 0.66 0 + 1 0.66
POST_DBSCAN_EPSILON 3.0 0 + 1 3.0
POST_DBSCAN_MINPTS 6 0 + 1 6
PRE_MINIMUMPERCENT 25 0 + 1 25

Total number of experiments: 64


INSCY
 FromOffsetOpStepsTo
density 10 0 + 1 10
epsilon 10 2 * 5 160
gridSize 50 0 + 1 50
maximalClusterRate 0.0 0 + 1 0.0
minPoints 2 2 * 5 32
minSize 20 2 * 5 320
usingKernel 1 0 + 1 1

Total number of experiments: 125



Clustering Oriented Paradigm
PROCLUS
 FromOffsetOpStepsTo
avgerageDimensions 1 2 + 38 75
numberOfClusters 2 2 * 6 64

Total number of experiments: 228


P3C
 FromOffsetOpStepsTo
alpha 0.001 0 + 1 0.001
possion 10 10 + 10 100

Total number of experiments: 10


STATPC
 FromOffsetOpStepsTo
alpha 0 1.0E-20 10000 * 6 1
alpha h 1.0E-20 10000 * 6 1
alpha k 1.0E-20 10000 * 6 1

Total number of experiments: 216



 

2. Real World Data:

Cell-Based Paradigm
CLIQUE
 FromOffsetOpStepsTo
TAU 0.001 10 * 3 0.1
XI 5 5 + 6 30

Total number of experiments: 18


DOC
 FromOffsetOpStepsTo
ALPHA 0.001 10 * 3 0.1
BETA 0.1 0.1 + 4 0.4
MAXITER 1024 0 + 1 1024
k 2 2 * 6 64
w 5 2 * 3 200

Total number of experiments: 216


MINECLUS
 FromOffsetOpStepsTo
ALPHA 0.001 10 * 3 0.1
BETA 0.1 0.1 + 4 0.4
MAXOUT -1 0 + 1 -1.0
k 2 2 * 6 64
numBins 1 0 + 1 1
w 5 2 * 3 200

Total number of experiments: 216


SCHISM
 FromOffsetOpStepsTo
TAU 1.0E-12 1000 * 5 1.0
XI 5 5 + 6 30
u 0.05 0 + 1 0.05

Total number of experiments: 30



Density-Based Paradigm
SUBCLU
 FromOffsetOpStepsTo
epsilon 1 1.6 * 9 42.9
minPoints 2 2 * 6 64

Total number of experiments: 54


FIRES
 FromOffsetOpStepsTo
BASE_DBSCAN_EPSILON 0.4 0 + 1 0.4
BASE_DBSCAN_MINPTS 6 0 + 1 6
GRAPH_K 3 1 + 8 10
GRAPH_MINCLU 1 1 + 4 4
GRAPH_MU 1 1 + 10 10
GRAPH_SPLIT 0.66 0 + 1 0.66
POST_DBSCAN_EPSILON 2 0 + 1 2
POST_DBSCAN_MINPTS 6 0 + 1 6
PRE_MINIMUMPERCENT 25 0 + 1 25

Total number of experiments: 320


INSCY
 FromOffsetOpStepsTo
density 10 0 + 1 10
epsilon 1 1.6 * 9 42.9
gridSize 10 0 + 1 10
maximalClusterRate 0.0 0 + 1 0.0
minPoints 2 2 * 6 64
minSize 2 2 * 8 256
usingKernel 1 0 + 1 1

Total number of experiments: 432



Clustering Oriented Paradigm
PROCLUS
 FromOffsetOpStepsTo
avgerageDimensions 2 2 + 16 32
numberOfClusters 2 4 + 14 54

Total number of experiments: 224


P3C
 FromOffsetOpStepsTo
alpha 0.001 0 + 1 0.001
possion 10 10 + 10 100

Total number of experiments: 10


STATPC
 FromOffsetOpStepsTo
alpha 0 1.0E-20 10000 * 6 1
alpha h 1.0E-20 10000 * 6 1
alpha k 1.0E-20 10000 * 6 1

Total number of experiments: 216