Evaluating Clustering in Subspace Projections of High
Dimensional Data
In. Proc. 35th International Conference on Very
Large Data Bases (VLDB 2009), Lyon,
France. (2009)
Supplementary material concerning repeatability
On this website we provide supplementary matirial enabling
repeatability of our experiments. Please refer to our paper for further
details about the clustering paradigms and evaluation measures we used.
We focus on this website only on the underlying implementation in the OpenSubspace framework and
the used parameter settings for
our evaluation.
Using the WEKA framework for repeatability
For an easy repeatability we
integrated all algorithms for
clustering in subspace projections of high dimensional data into the
popular
WEKA framework. We therefore extended the framework to subspace
clustering. A short description of this extension and how to
use it can be found on our OpenSubspace website.
It includes a video tutorial giving a short introduction. Using our
framework one can perform all evaluation measurements presented in our
paper. Furthermore, one can interactively explore the clustering
results. Our ongoing work is focusing on this open framework which we
develop in contact with the WEKA development team.
Repeatability
expectations
We have reasonable expectations regarding the similarity of
the experiment results. There will be made two observations,
one by us as the authors of this paper and one by other
researchers
repeating our experiments.
These two observations should be similar; however they also might
slightly varying from each other. For
instance, there
is no way to ensure that our hardware used in the experiments is
available
to other researchers repeating the evaluation. Therefore, we
do not expect measured execution times to match those reported in the
paper, but rather roughly similar curve tendencies. When measuring
other things than running time, such as e.g. result sizes in
experiments with no randomized component, we do expect to obtain the
results presented in the paper. Some of the approaches,
however,
include randomized components and thus will only show in the average
case similar results, single runs might vary. Randomized components are
e.g. random initialization of cluster centers in the PROCLUS algorithm.
Parameter settings
For repeatability, names of parameters are as in the original
publications. Please refer to the original publications for a more
detailed description. Some parameters have not been described or named
in the publications; therefore we tried to give them as meaningful
names as possible. Most of these not further described parameters are
included as we use the original implementations provided by the authors
of
SUBCLU, FIRES, INSCY and MINECLUS. For example,
FIRES has its main parameters (K, MINCLU and MU)
while pre-
and post-processing parameters are only of minor interest. We have
optimized all parameters for each algorithm on each data set and listed
them for repeatability on this website.
Resources
Note: Java 1.6 is required in order to run OpenSubspace.
Citation Information
If you publish material based on databases, algorithms,
parameter settings or evaluation measures obtained from this
repository, then, in your acknowledgments, please note the assistance
you received by using this repository. This will help others to obtain
the same data sets, algorithms, parameter settings and evaluation
measures and replicate your experiments. We suggest the
following reference format for referring to this project:
Müller
E., Günnemann S., Assent
I., Seidl T.:
Evaluating Clustering in Subspace Projections of High Dimensional Data
http://dme.rwth-aachen.de/OpenSubspace/evaluation/
In Proc. 35th International Conference on Very Large Data
Bases (VLDB 2009), Lyon, France.
(2009)
Parameter Settings
for repeatability of our experiments:
1. Synthetic Data:
Cell-Based Paradigm
CLIQUE
| |
From |
Offset |
Op |
Steps |
To |
| TAU |
0.001 |
10 |
* |
3 |
0.1 |
| XI |
5 |
5 |
+ |
6 |
30 |
|
Total
number of experiments: 18 |
|
DOC
| |
From |
Offset |
Op |
Steps |
To |
| ALPHA |
0.001 |
10 |
* |
3 |
0.1 |
| BETA |
0.1 |
0.1 |
+ |
4 |
0.4 |
| MAXITER |
1024 |
0 |
+ |
1 |
1024 |
| k |
2 |
2 |
* |
6 |
64 |
| w |
50 |
2 |
* |
3 |
200 |
|
Total
number of experiments: 216 |
|
MINECLUS
| |
From |
Offset |
Op |
Steps |
To |
| ALPHA |
0.001 |
10 |
* |
3 |
0.1 |
| BETA |
0.1 |
0.1 |
+ |
4 |
0.4 |
| MAXOUT |
-1 |
0 |
+ |
1 |
-1.0 |
| k |
2 |
2 |
* |
6 |
64 |
| numBins |
1 |
0 |
+ |
1 |
1 |
| w |
50 |
2 |
* |
3 |
200 |
|
Total
number of experiments: 216 |
|
SCHISM
| |
From |
Offset |
Op |
Steps |
To |
| TAU |
1.0E-12 |
1000 |
* |
5 |
1.0 |
| XI |
5 |
5 |
+ |
6 |
30 |
| u |
0.05 |
0 |
+ |
1 |
0.05 |
|
Total
number of experiments: 30 |
|
Density-Based Paradigm
SUBCLU
| |
From |
Offset |
Op |
Steps |
To |
| epsilon |
10 |
2 |
* |
6 |
320 |
| minPoints |
2 |
2 |
* |
5 |
32 |
|
Total
number of experiments: 30 |
|
FIRES
| |
From |
Offset |
Op |
Steps |
To |
| BASE_DBSCAN_EPSILON |
1.0 |
0 |
+ |
1 |
1.0 |
| BASE_DBSCAN_MINPTS |
6 |
0 |
+ |
1 |
6 |
| GRAPH_K |
1 |
3 |
+ |
4 |
10 |
| GRAPH_MINCLU |
1 |
1 |
+ |
4 |
4 |
| GRAPH_MU |
1 |
3 |
+ |
4 |
10 |
| GRAPH_SPLIT |
0.66 |
0 |
+ |
1 |
0.66 |
| POST_DBSCAN_EPSILON |
3.0 |
0 |
+ |
1 |
3.0 |
| POST_DBSCAN_MINPTS |
6 |
0 |
+ |
1 |
6 |
| PRE_MINIMUMPERCENT |
25 |
0 |
+ |
1 |
25 |
|
Total
number of experiments: 64 |
|
INSCY
| |
From |
Offset |
Op |
Steps |
To |
| density |
10 |
0 |
+ |
1 |
10 |
| epsilon |
10 |
2 |
* |
5 |
160 |
| gridSize |
50 |
0 |
+ |
1 |
50 |
| maximalClusterRate |
0.0 |
0 |
+ |
1 |
0.0 |
| minPoints |
2 |
2 |
* |
5 |
32 |
| minSize |
20 |
2 |
* |
5 |
320 |
| usingKernel |
1 |
0 |
+ |
1 |
1 |
|
Total
number of experiments: 125 |
|
Clustering Oriented Paradigm
PROCLUS
| |
From |
Offset |
Op |
Steps |
To |
| avgerageDimensions |
1 |
2 |
+ |
38 |
75 |
| numberOfClusters |
2 |
2 |
* |
6 |
64 |
|
Total
number of experiments: 228 |
|
P3C
| |
From |
Offset |
Op |
Steps |
To |
| alpha |
0.001 |
0 |
+ |
1 |
0.001 |
| possion |
10 |
10 |
+ |
10 |
100 |
|
Total
number of experiments: 10 |
|
STATPC
| |
From |
Offset |
Op |
Steps |
To |
| alpha
0 |
1.0E-20 |
10000 |
* |
6 |
1 |
| alpha
h |
1.0E-20 |
10000 |
* |
6 |
1 |
| alpha
k |
1.0E-20 |
10000 |
* |
6 |
1 |
|
Total
number of experiments: 216 |
|
2. Real World Data:
Cell-Based Paradigm
CLIQUE
| |
From |
Offset |
Op |
Steps |
To |
| TAU |
0.001 |
10 |
* |
3 |
0.1 |
| XI |
5 |
5 |
+ |
6 |
30 |
|
Total
number of experiments: 18 |
|
DOC
| |
From |
Offset |
Op |
Steps |
To |
| ALPHA |
0.001 |
10 |
* |
3 |
0.1 |
| BETA |
0.1 |
0.1 |
+ |
4 |
0.4 |
| MAXITER |
1024 |
0 |
+ |
1 |
1024 |
| k |
2 |
2 |
* |
6 |
64 |
| w |
5 |
2 |
* |
3 |
200 |
|
Total
number of experiments: 216 |
|
MINECLUS
| |
From |
Offset |
Op |
Steps |
To |
| ALPHA |
0.001 |
10 |
* |
3 |
0.1 |
| BETA |
0.1 |
0.1 |
+ |
4 |
0.4 |
| MAXOUT |
-1 |
0 |
+ |
1 |
-1.0 |
| k |
2 |
2 |
* |
6 |
64 |
| numBins |
1 |
0 |
+ |
1 |
1 |
| w |
5 |
2 |
* |
3 |
200 |
|
Total
number of experiments: 216 |
|
SCHISM
| |
From |
Offset |
Op |
Steps |
To |
| TAU |
1.0E-12 |
1000 |
* |
5 |
1.0 |
| XI |
5 |
5 |
+ |
6 |
30 |
| u |
0.05 |
0 |
+ |
1 |
0.05 |
|
Total
number of experiments: 30 |
|
Density-Based Paradigm
SUBCLU
| |
From |
Offset |
Op |
Steps |
To |
| epsilon |
1 |
1.6 |
* |
9 |
42.9 |
| minPoints |
2 |
2 |
* |
6 |
64 |
|
Total
number of experiments: 54 |
|
FIRES
| |
From |
Offset |
Op |
Steps |
To |
| BASE_DBSCAN_EPSILON |
0.4 |
0 |
+ |
1 |
0.4 |
| BASE_DBSCAN_MINPTS |
6 |
0 |
+ |
1 |
6 |
| GRAPH_K |
3 |
1 |
+ |
8 |
10 |
| GRAPH_MINCLU |
1 |
1 |
+ |
4 |
4 |
| GRAPH_MU |
1 |
1 |
+ |
10 |
10 |
| GRAPH_SPLIT |
0.66 |
0 |
+ |
1 |
0.66 |
| POST_DBSCAN_EPSILON |
2 |
0 |
+ |
1 |
2 |
| POST_DBSCAN_MINPTS |
6 |
0 |
+ |
1 |
6 |
| PRE_MINIMUMPERCENT |
25 |
0 |
+ |
1 |
25 |
|
Total
number of experiments: 320 |
|
INSCY
| |
From |
Offset |
Op |
Steps |
To |
| density |
10 |
0 |
+ |
1 |
10 |
| epsilon |
1 |
1.6 |
* |
9 |
42.9 |
| gridSize |
10 |
0 |
+ |
1 |
10 |
| maximalClusterRate |
0.0 |
0 |
+ |
1 |
0.0 |
| minPoints |
2 |
2 |
* |
6 |
64 |
| minSize |
2 |
2 |
* |
8 |
256 |
| usingKernel |
1 |
0 |
+ |
1 |
1 |
|
Total
number of experiments: 432 |
|
Clustering Oriented Paradigm
PROCLUS
| |
From |
Offset |
Op |
Steps |
To |
| avgerageDimensions |
2 |
2 |
+ |
16 |
32 |
| numberOfClusters |
2 |
4 |
+ |
14 |
54 |
|
Total
number of experiments: 224 |
|
P3C
| |
From |
Offset |
Op |
Steps |
To |
| alpha |
0.001 |
0 |
+ |
1 |
0.001 |
| possion |
10 |
10 |
+ |
10 |
100 |
|
Total
number of experiments: 10 |
|
STATPC
| |
From |
Offset |
Op |
Steps |
To |
| alpha
0 |
1.0E-20 |
10000 |
* |
6 |
1 |
| alpha
h |
1.0E-20 |
10000 |
* |
6 |
1 |
| alpha
k |
1.0E-20 |
10000 |
* |
6 |
1 |
|
Total
number of experiments: 216 |
|