Relevant Subspace Clustering:
Mining the Most Interesting Non-Redundant Concepts in High Dimensional Data

 

In. Proc. IEEE International Conference on Data Mining (ICDM 2009), Miami, USA (2009)

Emmanuel Müller, Ira Assent, Stephan Günnemann, Ralph Krieger and Thomas Seidl

Supplementary material concerning repeatability

On this website we provide supplementary matirial enabling repeatability of our experiments. Please refer to our paper for further details about our novel clustering model and the underlying algorithm. We focus on this website only on the used parameter settings and data sets for our evaluation which is based on the implementation in the OpenSubspace framework.

Using the WEKA framework for repeatability  

For an easy repeatability we integrated all algorithms for clustering in subspace projections of high dimensional data into the popular WEKA framework. We therefore extended the framework to subspace clustering. A short description of this extension and how to use it can be found on our OpenSubspace website. It includes a video tutorial giving a short introduction. Using our framework one can perform all evaluation measurements presented in our paper. Furthermore, one can interactively explore the clustering results.

Repeatability results

We expect repeatability results to be reasonably similar. There will be made two observations, one by us as the authors of this paper and one by other researchers repeating our experiments. These two observations should be similar; however they also might slightly varying from each other. For instance, there is no way to ensure that our hardware used in the experiments is available to other researchers repeating the evaluation. Therefore, we do not expect measured execution times to match those reported in the paper, but curve tendencies should be the same. When measuring other things than running time, such as e.g. result sizes in experiments with no randomized component, we do expect to obtain the results presented in the paper. Some of the approaches, however, include randomized components and thus will only show in the average case similar results, single runs might vary. Randomized components are e.g. random initialization of cluster centers in the PROCLUS algorithm.

Parameter settings

For repeatability, names of parameters are as in the original publications. Please refer to the original publications for a more detailed description. Some parameters have not been described or named in the publications; therefore we tried to give them as meaningful names as possible. Most of these not further described parameters are included as we use the original implementations provided by the authors of INSCY and FIRES. For example, FIRES has its main parameters (K, MINCLU and MU) while pre- and post-processing parameters are only of minor interest. For our relevant subspace clustering approach, we have given additionally to the main model parameters DELTA and BETA further parameters due to the cluster instantiation for density-based clustering and algorithmic parameters for grid size, merge operations on grids and pruning in our ranking. We have optimized all parameters for each algorithm on each data set and listed them for repeatability on this website.

Citation Information

If you publish results based on our material, then please include a reference to our paper. This will help others to obtain the same data sets, algorithms, parameter settings and evaluation measures and replicate your experiments. We suggest the following reference format for referring to this paper:

Müller E., Assent I., Günnemann S., Krieger R., Seidl T.:
Relevant Subspace Clustering: Mining the Most Interesting Non-Redundant Concepts in High Dimensional Data

In Proc. IEEE International Conference on Data Mining (ICDM 2009), Miami, USA. (2009) 





Parameter Settings for repeatability of our experiments:

 



Figure 5 (a): Result size w.r.t. dimensionality of the database




Figure 5 (b): Runtime w.r.t. dimensionality of the database


SUBCLU

  dim 5 dim 10 dim 15 dim 20 dim 25
epsilon 12 21 28 X X
minPoints 7 7 7 X X


CLIQUE

  dim 5 dim 10 dim 15 dim 20 dim 25
TAU 0.09 0.083 0.083 X X
XI 10 5 6 X X


INSCY

  dim 5 dim 10 dim 15 dim 20 dim 25
density 500 500 500 500 500
epsilon 20 32 39 42 45
gridSize 100 100 100 100 100
maximalClusterRate 0.0 0.0 0.0 0.0 0.0
minPoints 6 6 6 6 6
minSize 130 130 130 130 130
usingKernel 1 1 1 1 1


STATPC

  dim 5 dim 10 dim 15 dim 20 dim 25
alpha 0 1.0E-10 1.0E-10 1.0E-10 1.0E-10 1.0E-10
alpha h 1.0E-15 1.0E-12 1.0E-10 0.001 0.001
alpha k  1.0E-30  5.0E-20  5.0E-18  1.0E-17  1.0E-16


SCHISM

  dim 5 dim 10 dim 15 dim 20 dim 25
TAU 0.0145 0.0145 0.0145 0.0145 0.0145
XI 10 10 10 10 10
u 0.045 0.045 0.05 0.065 0.065


RESCU

  dim 5 dim 10 dim 15 dim 20 dim 25
BETA 1 1 1 1 1
DELTA 900 10000 10000 139000 123684
density_epsilon 6.4 6.4 6.4 6.4 6.4
density_minPoints 6 6 6 6 6
gridSize 100 100 100 100 100
merge 50 50 50 50 50
prune_1stFilter 50 50 50 50 50
prune_2ndFilter 100 100 100 100 100


FIRES

  dim 5 dim 10 dim 15 dim 20 dim 25
BASE_DBSCAN_EPSILON 1.0 1.0 1.0 1.0 1.0
BASE_DBSCAN_MINPTS 6 6 6 6 6
GRAPH_K 5 6 6 10 10
GRAPH_MINCLU 1 1 1 1 1
GRAPH_MU 4 5 5 9 9
GRAPH_SPLIT 0.66 0.66 0.66 0.66 0.66
POST_DBSCAN_EPSILON 3.0 3.0 3.0 3.0 3.0
POST_DBSCAN_MINPTS 6 6 6 6 6
PRE_MINIMUMPERCENT 25 25 25 25 25


P3C

  dim 5 dim 10 dim 15 dim 20 dim 25
alpha 0.001 0.001 0.001 0.001 0.001
possion 20 40 70 80 90


PROCLUS

  dim 5 dim 10 dim 15 dim 20 dim 25
avgerageDimensions 3 7 10 12 14
numberOfClusters 10 10 10 10 10






Figure 5 (c): Runtime w.r.t. size of the database

INSCY

  size 1595 size 3722 size 5848 size 7975 size 10102
density 500 500 500 500 500
epsilon 42 42 42 42 42
gridSize 100 100 100 100 100
maximalClusterRate 0.0 0.0 0.0 0.0 0.0
minPoints 6 6 6 6 6
minSize 130 130 130 130 130
usingKernel 1 1 1 1 1


STATPC

  size 1595 size 3722 size 5848 size 7975 size 10102
alpha 0 1.0E-10 1.0E-10 1.0E-10 1.0E-10 1.0E-10
alpha h 0.001 0.001 0.001 0.001 0.001
alpha k  1.0E-17  5.0E-17  5.0E-17  1.0E-17  1.0E-17


SCHISM

  size 1595 size 3722 size 5848 size 7975 size 10102
TAU 0.0045 0.0045 0.0045 0.0045 0.0045
XI 10 10 10 10 10
u 0.065 0.065 0.065 0.065 0.065


P3C

  size 1595 size 3722 size 5848 size 7975 size 10102
alpha 0.001 0.001 0.001 0.001 X
possion 80 120 160 200 X


FIRES

  size 1595 size 3722 size 5848 size 7975 size 10102
BASE_DBSCAN_EPSILON 1.0 1.0 1.0 1.0 1.0
BASE_DBSCAN_MINPTS 6 6 6 6 6
GRAPH_K 10 10 10 10 10
GRAPH_MINCLU 1 1 1 1 1
GRAPH_MU 9 9 9 9 9
GRAPH_SPLIT 0.66 0.66 0.66 0.66 0.66
POST_DBSCAN_EPSILON 3.0 3.0 3.0 3.0 3.0
POST_DBSCAN_MINPTS 6 6 6 6 6
PRE_MINIMUMPERCENT 25 25 25 25 25


RESCU

  size 1595 size 3722 size 5848 size 7975 size 10102
BETA 1 1 1 1 1
DELTA 139000 139000 139000 139000 139000
density_epsilon 6.4 6.4 6.4 6.4 6.4
density_minPoints 6 6 6 6 6
gridSize 100 100 100 100 100
merge 50 50 50 50 50
prune_1stFilter 50 50 50 50 50
prune_2ndFilter 100 100 100 100 100


PROCLUS

  size 1595 size 3722 size 5848 size 7975 size 10102
avgerageDimensions 12 12 12 12 12
numberOfClusters 10 10 10 10 10







Figure 6. Quality on pendigits data (7494 objects; 16-48 dimensions)

RESCU

  dim 16 dim 32 dim 48
BETA 5 5 10
DELTA 284 2000 600000
density_epsilon 3.4 3.1 3.4
density_minPoints 7 7 7
gridSize 10 10 10
merge 25 25 25
prune_1stFilter 50 50 50
prune_2ndFilter 100 100 100


INSCY

  dim 16 dim 32 dim 48
density 500 X X
epsilon 4.0 X X
gridSize 10 X X
maximalClusterRate 0.0 X X
minPoints 8 X X
minSize 200 X X
usingKernel 1 X X


FIRES

  dim 16 dim 32 dim 48
BASE_DBSCAN_EPSILON 0.12 0.12 0.12
BASE_DBSCAN_MINPTS 8 8 8
GRAPH_K 15 25 25
GRAPH_MINCLU 1 1 1
GRAPH_MU 14 24 24
GRAPH_SPLIT 0.66 0.66 0.66
POST_DBSCAN_EPSILON 0.01 0.004 0.002
POST_DBSCAN_MINPTS 8 8 8
PRE_MINIMUMPERCENT 25 25 25


SCHISM

  dim 16 dim 32 dim 48
TAU 0.02 0.0075 0.15
XI 10 23 26
u 0.15 0.1 0.1


PROCLUS

  dim 16 dim 32 dim 48
averageDimensions 8 15 28
numberOfClusters 19 15 23


P3C

  dim 16 dim 32 dim 48
alpha 0.001 X X
possion 50 X X


STATPC

  dim 16 dim 32 dim 48
alpha 0 1.0E-10 1.0E-10 1.0E-10
alpha h 0.001 0.001 0.001
alpha k 0.001 0.001 0.001


 



Figure 7: F1 and accuracy on real data
[Captions: data set (size; dimensionality)]

RESCU

  glass vowel diabetes shape liver breast
BETA 5 5 7 10 10 10
DELTA 1000 100 800 100 500 100000
density_epsilon 3.2 5.0 10.0 300.0 6.0 5.0
density_minPoints 3 3 3 3 6 3
gridSize 10 10 10 600 10 10
merge 3
3 3 5 5 3
prune_1stFilter 3 10 4 5 5 5
prune_2ndFilter 4 10 4 5 5 15

INSCY

  glass vowel diabetes shape liver breast
density 10 10 5 1 10 10
epsilon 3.0 8.0 3.0 100.0 2.5 4.0
gridSize 10 20 10 200 10 10
maximalClusterRate 0.1 0.05 0.0 0.05 0.0 0.0
minPoints 3 3 4 3 5 2
minSize 2 15 18 5 5 20
usingKernel 1 1 1 1 1 1


FIRES

  glass vowel diabetes shape liver breast
BASE_DBSCAN_EPSILON 0.4 0.12 0.02 25 0.12 0.2
BASE_DBSCAN_MINPTS 5 8 8 4 8 6
GRAPH_K 10 3 5 5 4 5
GRAPH_MINCLU 1 1 1 1 1 1
GRAPH_MU 9 2 4 4 3 4
GRAPH_SPLIT 0.66 0.66 0.66 0.66 0.66 0.66
POST_DBSCAN_EPSILON 0.2 2.0 0.5 20.0 1.0 2.0
POST_DBSCAN_MINPTS 5 8 8 4 8 6
PRE_MINIMUMPERCENT 25 25 25 25 25 25


SCHISM

  glass vowel diabetes shape liver breast
TAU 0.05 0.01 0.145 0.1 0.05 0.1
XI 10 10 7 15 10 6
u 0.08 0.01 0.06 0.0025 0.1 0.2


PROCLUS

  glass vowel diabetes shape liver breast
averageDimensions 5 9 6 12 3 11
numberOfClusters 16 29 6 17 10 8


P3C

  glass vowel diabetes shape liver breast
alpha 0.001 0.001 0.001 0.0080 0.001 0.001
possion 20 20 0 6 20 10


STATPC

  glass vowel diabetes shape liver breast
alpha 0 0.01 0.1 1.0E-10 0.1 0.1 0.9
alpha h 0.01 0.01 0.001 0.01 0.1 0.9
alpha k 1.0E-5 0.1 0.1 0.1 1.0E-4 0.9





Figure 8: Approximation quality (delta variation)

To calculate the optimal solution and compare it with the approximation we provide an additional tool (optimal.jar and runOptimal.bat).

The DELTA values used in our experiment are: 0, 10, 20, 30, 40, 50, 100, 150, and 200.





Figure 9: Effects on clustering by beta variation

RESCU

BETA 0.02 0.05 0.07 0.08 0.1 0.15 0.2 0.3 0.4 0.5 0.6 0.7 0.8 1 1.5 2 3 4 5 8 10
DELTA 31 32 32 33 33 35 37 41 45 50 56 62 68 84 141 235 659 1844 5163 113341 888590
density_epsilon 3.1
density_minPoints 7
gridSize 10
merge 25
prune_1stFilter 50
prune_2ndFilter 100