## Evaluating Clustering in Subspace Projections of High Dimensional Data

### In. Proc. 35th International Conference on Very Large Data Bases (**VLDB 2009**), Lyon, France. (2009)

### Emmanuel Müller, Stephan Günnemann, Ira Assent and ThomasSeidl

**Supplementary material concerning repeatability**

On this website we provide supplementary matirial enabling repeatability of our experiments. Please refer to our paper for further details about the clustering paradigms and evaluation measures we used. We focus on this website only on the underlying implementation in the OpenSubspace framework and the used parameter settings for our evaluation.

**Using the WEKA framework for repeatability **

For an easy repeatability we integrated all algorithms for clustering in subspace projections of high dimensional data into the popular WEKA framework. We therefore extended the framework to subspace clustering. A short description of this extension and how to use it can be found on our OpenSubspace website. It includes a video tutorial giving a short introduction. Using our framework one can perform all evaluation measurements presented in our paper. Furthermore, one can interactively explore the clustering results. Our ongoing work is focusing on this open framework which we develop in contact with the WEKA development team.

**Repeatability expectations**

We have reasonable expectations regarding the similarity of the experiment results. There will be made two observations, one by us as the authors of this paper and one by other researchers repeating our experiments. These two observations should be similar; however they also might slightly varying from each other. For instance, there is no way to ensure that our hardware used in the experiments is available to other researchers repeating the evaluation. Therefore, we do not expect measured execution times to match those reported in the paper, but rather roughly similar curve tendencies. When measuring other things than running time, such as e.g. result sizes in experiments with no randomized component, we do expect to obtain the results presented in the paper. Some of the approaches, however, include randomized components and thus will only show in the average case similar results, single runs might vary. Randomized components are e.g. random initialization of cluster centers in the PROCLUS algorithm.

**Parameter settings**

For repeatability, names of parameters are as in the original publications. Please refer to the original publications for a more detailed description. Some parameters have not been described or named in the publications; therefore we tried to give them as meaningful names as possible. Most of these not further described parameters are included as we use the original implementations provided by the authors of SUBCLU, FIRES, INSCY and MINECLUS. For example, FIRES has its main parameters (K, MINCLU and MU) while pre- and post-processing parameters are only of minor interest. We have optimized all parameters for each algorithm on each data set and listed them for repeatability on this website.

**Resources**

Note: Java 1.6 is required in order to run OpenSubspace.

Executables and Sources: (including WEKA 3.5.8) |
OpenSubspace.zip |

Data sets and cluster models: | data.zip |

Videotutorial: | weka_subspaceclustering.avi |

**Citation Information**

If you publish material based on databases, algorithms, parameter settings or evaluation measures obtained from this repository, then, in your acknowledgments, please note the assistance you received by using this repository. This will help others to obtain the same data sets, algorithms, parameter settings and evaluation measures and replicate your experiments. We suggest the following reference format for referring to this project:

Müller E., Günnemann S., Assent I., Seidl T.:

Evaluating Clustering in Subspace Projections of High Dimensional Data

http://dme.rwth-aachen.de/OpenSubspace/evaluation/

In Proc. 35th International Conference on Very Large Data Bases (**VLDB 2009**), Lyon, France. (2009)

## Parameter Settings for repeatability of our experiments:

**1. Synthetic Data:**

**Cell-Based Paradigm**

**CLIQUE**

From | Offset | Op | Steps | To | |
---|---|---|---|---|---|

TAU | 0.001 | 10 | * | 3 | 0.1 |

XI | 5 | 5 | + | 6 | 30 |

Total number of experiments: 18 |

**DOC**

From | Offset | Op | Steps | To | |
---|---|---|---|---|---|

ALPHA | 0.001 | 10 | * | 3 | 0.1 |

BETA | 0.1 | 0.1 | + | 4 | 0.4 |

MAXITER | 1024 | 0 | + | 1 | 1024 |

k | 2 | 2 | * | 6 | 64 |

w | 50 | 2 | * | 3 | 200 |

Total number of experiments: 216 |

**MINECLUS**

From | Offset | Op | Steps | To | |
---|---|---|---|---|---|

ALPHA | 0.001 | 10 | * | 3 | 0.1 |

BETA | 0.1 | 0.1 | + | 4 | 0.4 |

MAXOUT | -1 | 0 | + | 1 | -1.0 |

k | 2 | 2 | * | 6 | 64 |

numBins | 1 | 0 | + | 1 | 1 |

w | 50 | 2 | * | 3 | 200 |

Total number of experiments: 216 |

**SCHISM**

From | Offset | Op | Steps | To | |
---|---|---|---|---|---|

TAU | 1.0E-12 | 1000 | * | 5 | 1.0 |

XI | 5 | 5 | + | 6 | 30 |

u | 0.05 | 0 | + | 1 | 0.05 |

Total number of experiments: 30 |

**Density-Based Paradigm**

**SUBCLU**

From | Offset | Op | Steps | To | |
---|---|---|---|---|---|

epsilon | 10 | 2 | * | 6 | 320 |

minPoints | 2 | 2 | * | 5 | 32 |

Total number of experiments: 30 |

**FIRES**

From | Offset | Op | Steps | To | |
---|---|---|---|---|---|

BASE_DBSCAN_EPSILON | 1.0 | 0 | + | 1 | 1.0 |

BASE_DBSCAN_MINPTS | 6 | 0 | + | 1 | 6 |

GRAPH_K | 1 | 3 | + | 4 | 10 |

GRAPH_MINCLU | 1 | 1 | + | 4 | 4 |

GRAPH_MU | 1 | 3 | + | 4 | 10 |

GRAPH_SPLIT | 0.66 | 0 | + | 1 | 0.66 |

POST_DBSCAN_EPSILON | 3.0 | 0 | + | 1 | 3.0 |

POST_DBSCAN_MINPTS | 6 | 0 | + | 1 | 6 |

PRE_MINIMUMPERCENT | 25 | 0 | + | 1 | 25 |

Total number of experiments: 64 |

**INSCY**

From | Offset | Op | Steps | To | |
---|---|---|---|---|---|

density | 10 | 0 | + | 1 | 10 |

epsilon | 10 | 2 | * | 5 | 160 |

gridSize | 50 | 0 | + | 1 | 50 |

maximalClusterRate | 0.0 | 0 | + | 1 | 0.0 |

minPoints | 2 | 2 | * | 5 | 32 |

minSize | 20 | 2 | * | 5 | 320 |

usingKernel | 1 | 0 | + | 1 | 1 |

Total number of experiments: 125 |

**Clustering Oriented Paradigm**

**PROCLUS**

From | Offset | Op | Steps | To | |
---|---|---|---|---|---|

avgerageDimensions | 1 | 2 | + | 38 | 75 |

numberOfClusters | 2 | 2 | * | 6 | 64 |

Total number of experiments: 228 |

**P3C**

From | Offset | Op | Steps | To | |
---|---|---|---|---|---|

alpha | 0.001 | 0 | + | 1 | 0.001 |

possion | 10 | 10 | + | 10 | 100 |

Total number of experiments: 10 |

**STATPC**

From | Offset | Op | Steps | To | |
---|---|---|---|---|---|

alpha 0 | 1.0E-20 | 10000 | * | 6 | 1 |

alpha h | 1.0E-20 | 10000 | * | 6 | 1 |

alpha k | 1.0E-20 | 10000 | * | 6 | 1 |

Total number of experiments: 216 |

**2. Real World Data:**

**Cell-Based Paradigm**

**CLIQUE**

From | Offset | Op | Steps | To | |
---|---|---|---|---|---|

TAU | 0.001 | 10 | * | 3 | 0.1 |

XI | 5 | 5 | + | 6 | 30 |

Total number of experiments: 18 |

**DOC**

From | Offset | Op | Steps | To | |
---|---|---|---|---|---|

ALPHA | 0.001 | 10 | * | 3 | 0.1 |

BETA | 0.1 | 0.1 | + | 4 | 0.4 |

MAXITER | 1024 | 0 | + | 1 | 1024 |

k | 2 | 2 | * | 6 | 64 |

w | 5 | 2 | * | 3 | 200 |

Total number of experiments: 216 |

**MINECLUS**

From | Offset | Op | Steps | To | |
---|---|---|---|---|---|

ALPHA | 0.001 | 10 | * | 3 | 0.1 |

BETA | 0.1 | 0.1 | + | 4 | 0.4 |

MAXOUT | -1 | 0 | + | 1 | -1.0 |

k | 2 | 2 | * | 6 | 64 |

numBins | 1 | 0 | + | 1 | 1 |

w | 5 | 2 | * | 3 | 200 |

Total number of experiments: 216 |

**SCHISM**

From | Offset | Op | Steps | To | |
---|---|---|---|---|---|

TAU | 1.0E-12 | 1000 | * | 5 | 1.0 |

XI | 5 | 5 | + | 6 | 30 |

u | 0.05 | 0 | + | 1 | 0.05 |

Total number of experiments: 30 |

**Density-Based Paradigm**

**SUBCLU**

From | Offset | Op | Steps | To | |
---|---|---|---|---|---|

epsilon | 1 | 1.6 | * | 9 | 42.9 |

minPoints | 2 | 2 | * | 6 | 64 |

Total number of experiments: 54 |

**FIRES**

From | Offset | Op | Steps | To | |
---|---|---|---|---|---|

BASE_DBSCAN_EPSILON | 0.4 | 0 | + | 1 | 0.4 |

BASE_DBSCAN_MINPTS | 6 | 0 | + | 1 | 6 |

GRAPH_K | 3 | 1 | + | 8 | 10 |

GRAPH_MINCLU | 1 | 1 | + | 4 | 4 |

GRAPH_MU | 1 | 1 | + | 10 | 10 |

GRAPH_SPLIT | 0.66 | 0 | + | 1 | 0.66 |

POST_DBSCAN_EPSILON | 2 | 0 | + | 1 | 2 |

POST_DBSCAN_MINPTS | 6 | 0 | + | 1 | 6 |

PRE_MINIMUMPERCENT | 25 | 0 | + | 1 | 25 |

Total number of experiments: 320 |

**INSCY**

From | Offset | Op | Steps | To | |
---|---|---|---|---|---|

density | 10 | 0 | + | 1 | 10 |

epsilon | 1 | 1.6 | * | 9 | 42.9 |

gridSize | 10 | 0 | + | 1 | 10 |

maximalClusterRate | 0.0 | 0 | + | 1 | 0.0 |

minPoints | 2 | 2 | * | 6 | 64 |

minSize | 2 | 2 | * | 8 | 256 |

usingKernel | 1 | 0 | + | 1 | 1 |

Total number of experiments: 432 |

**Clustering Oriented Paradigm**

**PROCLUS**

From | Offset | Op | Steps | To | |
---|---|---|---|---|---|

avgerageDimensions | 2 | 2 | + | 16 | 32 |

numberOfClusters | 2 | 4 | + | 14 | 54 |

Total number of experiments: 224 |

**P3C**

From | Offset | Op | Steps | To | |
---|---|---|---|---|---|

alpha | 0.001 | 0 | + | 1 | 0.001 |

possion | 10 | 10 | + | 10 | 100 |

Total number of experiments: 10 |

**STATPC**

From | Offset | Op | Steps | To | |
---|---|---|---|---|---|

alpha 0 | 1.0E-20 | 10000 | * | 6 | 1 |

alpha h | 1.0E-20 | 10000 | * | 6 | 1 |

alpha k | 1.0E-20 | 10000 | * | 6 | 1 |

Total number of experiments: 216 |