Relevant Subspace Clustering:
Mining the Most Interesting Non-Redundant Concepts in High Dimensional
Data
In. Proc. IEEE International Conference on Data
Mining (ICDM 2009), Miami, USA (2009)
Supplementary material concerning repeatability
On this website we provide supplementary matirial enabling
repeatability of our experiments. Please refer to our paper
for further
details about our novel clustering model and the underlying algorithm.
We focus on this website only on the used parameter
settings and data
sets for
our evaluation which is based on the implementation in the OpenSubspace
framework.
Using the WEKA framework for repeatability
For an easy repeatability we
integrated all algorithms for
clustering in subspace projections of high dimensional data into the
popular
WEKA framework. We therefore extended the framework to subspace
clustering. A short description of this extension and how to
use it can be found on our OpenSubspace
website.
It includes a video tutorial giving a short introduction. Using our
framework one can perform all evaluation measurements presented in our
paper. Furthermore, one can interactively explore the clustering
results.
Repeatability results
We expect repeatability results to be reasonably similar. There will be made two observations,
one by us as the authors of this paper and one by other
researchers
repeating our experiments.
These two observations should be similar; however they also might
slightly varying from each other. For
instance, there
is no way to ensure that our hardware used in the experiments is
available
to other researchers repeating the evaluation. Therefore, we
do not expect measured execution times to match those reported in the
paper, but curve tendencies should be the same. When measuring
other things than running time, such as e.g. result sizes in
experiments with no randomized component, we do expect to obtain the
results presented in the paper. Some of the approaches,
however,
include randomized components and thus will only show in the average
case similar results, single runs might vary. Randomized components are
e.g. random initialization of cluster centers in the PROCLUS algorithm.
Parameter settings
For repeatability, names of parameters are as in the original
publications. Please refer to the original publications for a more
detailed description. Some parameters have not been described or named
in the publications; therefore we tried to give them as meaningful
names as possible. Most of these not further described parameters are
included as we use the original implementations provided by the authors
of INSCY and FIRES. For example, FIRES has its main parameters (K,
MINCLU and MU) while pre- and post-processing parameters are only of
minor interest. For our relevant subspace clustering approach, we have given additionally to the
main model parameters DELTA and BETA further parameters due to the
cluster instantiation for density-based clustering and algorithmic
parameters for grid size, merge operations on grids and pruning in our
ranking. We have optimized all parameters for each algorithm on each
data set and listed them for repeatability on this website.
Citation Information
If you publish results based on our material, then please include a reference to our paper. This will help others to obtain
the same data sets, algorithms, parameter settings and evaluation
measures and replicate your experiments. We suggest the
following reference format for referring to this paper:
Müller
E., Assent
I., Günnemann S., Krieger R., Seidl T.:
Relevant Subspace Clustering: Mining the Most Interesting Non-Redundant
Concepts in High Dimensional Data
In Proc. IEEE International Conference on Data Mining (ICDM
2009), Miami, USA.
(2009)
Parameter Settings
for repeatability of our experiments:

Figure 5 (a): Result size w.r.t. dimensionality of the database

Figure 5 (b): Runtime w.r.t. dimensionality of the database
SUBCLU
| |
dim 5 |
dim 10 |
dim 15 |
dim 20 |
dim 25 |
| epsilon |
12 |
21 |
28 |
X |
X |
| minPoints |
7 |
7 |
7 |
X |
X |
CLIQUE
| |
dim 5 |
dim 10 |
dim 15 |
dim 20 |
dim 25 |
| TAU |
0.09 |
0.083 |
0.083 |
X |
X |
| XI |
10 |
5 |
6 |
X |
X |
INSCY
| |
dim 5 |
dim 10 |
dim 15 |
dim 20 |
dim 25 |
| density |
500 |
500 |
500 |
500 |
500 |
| epsilon |
20 |
32 |
39 |
42 |
45 |
| gridSize |
100 |
100 |
100 |
100 |
100 |
| maximalClusterRate |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
| minPoints |
6 |
6 |
6 |
6 |
6 |
| minSize |
130 |
130 |
130 |
130 |
130 |
| usingKernel |
1 |
1 |
1 |
1 |
1 |
STATPC
| |
dim 5 |
dim 10 |
dim 15 |
dim 20 |
dim 25 |
| alpha 0 |
1.0E-10 |
1.0E-10 |
1.0E-10 |
1.0E-10 |
1.0E-10 |
| alpha h |
1.0E-15 |
1.0E-12 |
1.0E-10 |
0.001 |
0.001 |
| alpha k |
1.0E-30 |
5.0E-20 |
5.0E-18 |
1.0E-17 |
1.0E-16 |
SCHISM
| |
dim 5 |
dim 10 |
dim 15 |
dim 20 |
dim 25 |
| TAU |
0.0145 |
0.0145 |
0.0145 |
0.0145 |
0.0145 |
| XI |
10 |
10 |
10 |
10 |
10 |
| u |
0.045 |
0.045 |
0.05 |
0.065 |
0.065 |
RESCU
| |
dim 5 |
dim 10 |
dim 15 |
dim 20 |
dim 25 |
| BETA |
1 |
1 |
1 |
1 |
1 |
| DELTA |
900 |
10000 |
10000 |
139000 |
123684 |
| density_epsilon |
6.4 |
6.4 |
6.4 |
6.4 |
6.4 |
| density_minPoints |
6 |
6 |
6 |
6 |
6 |
| gridSize |
100 |
100 |
100 |
100 |
100 |
| merge |
50 |
50 |
50 |
50 |
50 |
| prune_1stFilter |
50 |
50 |
50 |
50 |
50 |
| prune_2ndFilter |
100 |
100 |
100 |
100 |
100 |
FIRES
| |
dim 5 |
dim 10 |
dim 15 |
dim 20 |
dim 25 |
| BASE_DBSCAN_EPSILON |
1.0 |
1.0 |
1.0 |
1.0 |
1.0 |
| BASE_DBSCAN_MINPTS |
6 |
6 |
6 |
6 |
6 |
| GRAPH_K |
5 |
6 |
6 |
10 |
10 |
| GRAPH_MINCLU |
1 |
1 |
1 |
1 |
1 |
| GRAPH_MU |
4 |
5 |
5 |
9 |
9 |
| GRAPH_SPLIT |
0.66 |
0.66 |
0.66 |
0.66 |
0.66 |
| POST_DBSCAN_EPSILON |
3.0 |
3.0 |
3.0 |
3.0 |
3.0 |
| POST_DBSCAN_MINPTS |
6 |
6 |
6 |
6 |
6 |
| PRE_MINIMUMPERCENT |
25 |
25 |
25 |
25 |
25 |
P3C
| |
dim 5 |
dim 10 |
dim 15 |
dim 20 |
dim 25 |
| alpha |
0.001 |
0.001 |
0.001 |
0.001 |
0.001 |
| possion |
20 |
40 |
70 |
80 |
90 |
PROCLUS
| |
dim 5 |
dim 10 |
dim 15 |
dim 20 |
dim 25 |
| avgerageDimensions |
3 |
7 |
10 |
12 |
14 |
| numberOfClusters |
10 |
10 |
10 |
10 |
10 |

Figure 5 (c): Runtime w.r.t. size of the database
INSCY
| |
size 1595 |
size 3722 |
size 5848 |
size 7975 |
size 10102 |
| density |
500 |
500 |
500 |
500 |
500 |
| epsilon |
42 |
42 |
42 |
42 |
42 |
| gridSize |
100 |
100 |
100 |
100 |
100 |
| maximalClusterRate |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
| minPoints |
6 |
6 |
6 |
6 |
6 |
| minSize |
130 |
130 |
130 |
130 |
130 |
| usingKernel |
1 |
1 |
1 |
1 |
1 |
STATPC
| |
size 1595 |
size 3722 |
size 5848 |
size 7975 |
size 10102 |
| alpha 0 |
1.0E-10 |
1.0E-10 |
1.0E-10 |
1.0E-10 |
1.0E-10 |
| alpha h |
0.001 |
0.001 |
0.001 |
0.001 |
0.001 |
| alpha k |
1.0E-17 |
5.0E-17 |
5.0E-17 |
1.0E-17 |
1.0E-17 |
SCHISM
| |
size 1595 |
size 3722 |
size 5848 |
size 7975 |
size 10102 |
| TAU |
0.0045 |
0.0045 |
0.0045 |
0.0045 |
0.0045 |
| XI |
10 |
10 |
10 |
10 |
10 |
| u |
0.065 |
0.065 |
0.065 |
0.065 |
0.065 |
P3C
| |
size 1595 |
size 3722 |
size 5848 |
size 7975 |
size 10102 |
| alpha |
0.001 |
0.001 |
0.001 |
0.001 |
X |
| possion |
80 |
120 |
160 |
200 |
X |
FIRES
| |
size 1595 |
size 3722 |
size 5848 |
size 7975 |
size 10102 |
| BASE_DBSCAN_EPSILON |
1.0 |
1.0 |
1.0 |
1.0 |
1.0 |
| BASE_DBSCAN_MINPTS |
6 |
6 |
6 |
6 |
6 |
| GRAPH_K |
10 |
10 |
10 |
10 |
10 |
| GRAPH_MINCLU |
1 |
1 |
1 |
1 |
1 |
| GRAPH_MU |
9 |
9 |
9 |
9 |
9 |
| GRAPH_SPLIT |
0.66 |
0.66 |
0.66 |
0.66 |
0.66 |
| POST_DBSCAN_EPSILON |
3.0 |
3.0 |
3.0 |
3.0 |
3.0 |
| POST_DBSCAN_MINPTS |
6 |
6 |
6 |
6 |
6 |
| PRE_MINIMUMPERCENT |
25 |
25 |
25 |
25 |
25 |
RESCU
| |
size 1595 |
size 3722 |
size 5848 |
size 7975 |
size 10102 |
| BETA |
1 |
1 |
1 |
1 |
1 |
| DELTA |
139000 |
139000 |
139000 |
139000 |
139000 |
| density_epsilon |
6.4 |
6.4 |
6.4 |
6.4 |
6.4 |
| density_minPoints |
6 |
6 |
6 |
6 |
6 |
| gridSize |
100 |
100 |
100 |
100 |
100 |
| merge |
50 |
50 |
50 |
50 |
50 |
| prune_1stFilter |
50 |
50 |
50 |
50 |
50 |
| prune_2ndFilter |
100 |
100 |
100 |
100 |
100 |
PROCLUS
| |
size 1595 |
size 3722 |
size 5848 |
size 7975 |
size 10102 |
| avgerageDimensions |
12 |
12 |
12 |
12 |
12 |
| numberOfClusters |
10 |
10 |
10 |
10 |
10 |

Figure 6. Quality on pendigits data (7494 objects; 16-48 dimensions)
RESCU
| |
dim 16 |
dim 32 |
dim 48 |
| BETA |
5 |
5 |
10 |
| DELTA |
284 |
2000 |
600000 |
| density_epsilon |
3.4 |
3.1 |
3.4 |
| density_minPoints |
7 |
7 |
7 |
| gridSize |
10 |
10 |
10 |
| merge |
25 |
25 |
25 |
| prune_1stFilter |
50 |
50 |
50 |
| prune_2ndFilter |
100 |
100 |
100 |
INSCY
| |
dim 16 |
dim 32 |
dim 48 |
| density |
500 |
X |
X |
| epsilon |
4.0 |
X |
X |
| gridSize |
10 |
X |
X |
| maximalClusterRate |
0.0 |
X |
X |
| minPoints |
8 |
X |
X |
| minSize |
200 |
X |
X |
| usingKernel |
1 |
X |
X |
FIRES
| |
dim 16 |
dim 32 |
dim 48 |
| BASE_DBSCAN_EPSILON |
0.12 |
0.12 |
0.12 |
| BASE_DBSCAN_MINPTS |
8 |
8 |
8 |
| GRAPH_K |
15 |
25 |
25 |
| GRAPH_MINCLU |
1 |
1 |
1 |
| GRAPH_MU |
14 |
24 |
24 |
| GRAPH_SPLIT |
0.66 |
0.66 |
0.66 |
| POST_DBSCAN_EPSILON |
0.01 |
0.004 |
0.002 |
| POST_DBSCAN_MINPTS |
8 |
8 |
8 |
| PRE_MINIMUMPERCENT |
25 |
25 |
25 |
SCHISM
| |
dim 16 |
dim 32 |
dim 48 |
| TAU |
0.02 |
0.0075 |
0.15 |
| XI |
10 |
23 |
26 |
| u |
0.15 |
0.1 |
0.1 |
PROCLUS
| |
dim 16 |
dim 32 |
dim 48 |
| averageDimensions |
8 |
15 |
28 |
| numberOfClusters |
19 |
15 |
23 |
P3C
| |
dim 16 |
dim 32 |
dim 48 |
| alpha |
0.001 |
X |
X |
| possion |
50 |
X |
X |
STATPC
| |
dim 16 |
dim 32 |
dim 48 |
| alpha 0 |
1.0E-10 |
1.0E-10 |
1.0E-10 |
| alpha h |
0.001 |
0.001 |
0.001 |
| alpha k |
0.001 |
0.001 |
0.001 |

Figure 7: F1 and accuracy on real data
[Captions: data set (size; dimensionality)]
RESCU
| |
glass |
vowel |
diabetes |
shape |
liver |
breast |
| BETA |
5 |
5 |
7 |
10 |
10 |
10 |
| DELTA |
1000 |
100 |
800 |
100 |
500 |
100000 |
| density_epsilon |
3.2 |
5.0 |
10.0 |
300.0 |
6.0 |
5.0 |
| density_minPoints |
3 |
3 |
3 |
3 |
6 |
3 |
| gridSize |
10 |
10 |
10 |
600 |
10 |
10 |
| merge |
3
|
3 |
3 |
5 |
5 |
3 |
| prune_1stFilter |
3 |
10 |
4 |
5 |
5 |
5 |
| prune_2ndFilter |
4 |
10 |
4 |
5 |
5 |
15 |
INSCY
| |
glass |
vowel |
diabetes |
shape |
liver |
breast |
| density |
10 |
10 |
5 |
1 |
10 |
10 |
| epsilon |
3.0 |
8.0 |
3.0 |
100.0 |
2.5 |
4.0 |
| gridSize |
10 |
20 |
10 |
200 |
10 |
10 |
| maximalClusterRate |
0.1 |
0.05 |
0.0 |
0.05 |
0.0 |
0.0 |
| minPoints |
3 |
3 |
4 |
3 |
5 |
2 |
| minSize |
2 |
15 |
18 |
5 |
5 |
20 |
| usingKernel |
1 |
1 |
1 |
1 |
1 |
1 |
FIRES
| |
glass |
vowel |
diabetes |
shape |
liver |
breast |
| BASE_DBSCAN_EPSILON |
0.4 |
0.12 |
0.02 |
25 |
0.12 |
0.2 |
| BASE_DBSCAN_MINPTS |
5 |
8 |
8 |
4 |
8 |
6 |
| GRAPH_K |
10 |
3 |
5 |
5 |
4 |
5 |
| GRAPH_MINCLU |
1 |
1 |
1 |
1 |
1 |
1 |
| GRAPH_MU |
9 |
2 |
4 |
4 |
3 |
4 |
| GRAPH_SPLIT |
0.66 |
0.66 |
0.66 |
0.66 |
0.66 |
0.66 |
| POST_DBSCAN_EPSILON |
0.2 |
2.0 |
0.5 |
20.0 |
1.0 |
2.0 |
| POST_DBSCAN_MINPTS |
5 |
8 |
8 |
4 |
8 |
6 |
| PRE_MINIMUMPERCENT |
25 |
25 |
25 |
25 |
25 |
25 |
SCHISM
| |
glass |
vowel |
diabetes |
shape |
liver |
breast |
| TAU |
0.05 |
0.01 |
0.145 |
0.1 |
0.05 |
0.1 |
| XI |
10 |
10 |
7 |
15 |
10 |
6 |
| u |
0.08 |
0.01 |
0.06 |
0.0025 |
0.1 |
0.2 |
PROCLUS
| |
glass |
vowel |
diabetes |
shape |
liver |
breast |
| averageDimensions |
5 |
9 |
6 |
12 |
3 |
11 |
| numberOfClusters |
16 |
29 |
6 |
17 |
10 |
8 |
P3C
| |
glass |
vowel |
diabetes |
shape |
liver |
breast |
| alpha |
0.001 |
0.001 |
0.001 |
0.0080 |
0.001 |
0.001 |
| possion |
20 |
20 |
0 |
6 |
20 |
10 |
STATPC
| |
glass |
vowel |
diabetes |
shape |
liver |
breast |
| alpha 0 |
0.01 |
0.1 |
1.0E-10 |
0.1 |
0.1 |
0.9 |
| alpha h |
0.01 |
0.01 |
0.001 |
0.01 |
0.1 |
0.9 |
| alpha k |
1.0E-5 |
0.1 |
0.1 |
0.1 |
1.0E-4 |
0.9 |

Figure 8: Approximation quality (
delta variation)
To calculate the optimal solution and compare it with the approximation
we provide an additional tool (
optimal.jar
and
runOptimal.bat).
The DELTA values used in our experiment are: 0, 10, 20, 30, 40, 50,
100, 150, and 200.

Figure 9: Effects on clustering by
beta variation