Subspace Clustering Meets Dense Subgraph Mining: A Synthesis of Two Paradigms
by Stephan Günnemann, Ines Färber, Brigitte Boden, Thomas Seidl
in Proc. IEEE International Conference on Data Mining (ICDM 2010), Sydney, Australia (2010)
On this page we offer the datasets and algorithms that were used for the experiments in our paper "Subspace Clustering Meets Dense Subgraph Mining: A Synthesis of Two Paradigms". Thus, repeatability and comparison are available for the data mining community. Please note the citation information.
The jar file expects two parameters: The name of the algorithm that shall be executed (possiblitities are gamer, copam, cocain, seqsubgraph and seqsubspace) and a config file which has the format of a Java Properties file. This config file contains the file name of the processed graph (the graph has to be represented in the graphml format, c.f. http://graphml.graphdrawing.org) and the values of the required parameters.
For example, for repeating our experiment on the gene data with GAMer, the command line would be:
java -jar ClusterDetection.jar gamer bio.properties
After processing, the algorithms will create a file containing the result. Here we present a tiny example that explains the contents of such a file:
In such a result file, every line corresponds to one identified cluster. The last line contains the processing time of the algorithm (in ms).
For every cluster, the first numbers correspond to the relevant dimensions of the cluster. In our example, the original dataset has three dimensions, so the first three numbers show the dimensions of the clusters, e.g. the first cluster lies only in the second dimension, indicated by the entry '1'. The first and the third dimension are not relevant for this cluster, indicated by the entry '0'.
The next number shows the number of nodes in the cluster, e.g. '5' for the first cluster.
The last numbers are the IDs of the nodes that the cluster contains. For the first cluster that would be the nodes with the IDs 3, 6, 7, 8 and 9.
Recommended parameter settings
GAMer is a flexible clustering approach, which can easily be adapted based on the users' needs. To allow an easy parameterization, we provide in the following recommended parameter settings that can act as a starting point for an in depth analysis of the data under consideration.
First, if no specific preferences about the clusters' qualities are given, we recommend the choice of a=b=c=1 for the quality function. In this case, all characteristics are equally important, which leads a sound synthesis of subspace clustering and dense subgraph mining.
Second, the redundancy between different clusters can be controlled by robj and rdim. For many applications we recommend to choose the parameters close to zero since it leads to an easy interpretation of the clustering result: two clusters are either disjoint w.r.t. their objects or w.r.t. to their dimensions. Thus, an object is not clustered twice within a single dimension.
Last, the user has to specify the characteristics each individual cluster has to fulfill. Actually, the parameters smin, γmin, and nmin do not really influence the characteristics of the clusters, but they simply control the number of valid clusters. Indeed, the overall clustering is quite robust if the threshold values are selected sufficiently small. Thus, if no further knowledge is given one could simply ignore the thresholds (at the cost of increased runtime since more clusters need to be analyzed). Often, however, some basic properties of the clusters are desired, e.g. clusters with two or three objects or clusters located in 1-dimensional subspaces are usually not meaningful. Furthermore, often only quasi-cliques with a density γ>=0.5 are considered interesting, as they are connected "tightly and relatively evenly" (cf. "Z. Zeng, J. Wang, L. Zhou, and G. Karypis: Coherent closed quasi-clique discovery from large dense graph databases. SIGKDD, 2006"). Thus, one should restrict the set of valid clusters a-priori by the three thresholds (e.g. γmin=0.5, nmin=4, smin=2) and far lower runtimes of the overall algorithm are achievable. The parameter w is easy to set since it simply controls the maximal extent of a cluster in the attribute space. Furthermore, heuristics to select the parameter w based on the given data are, e.g., discussed in "M. L. Yiu and N. Mamoulis: Iterative Projected Clustering by Subspace Mining. TKDE, 2005" and "C. M. Procopiuc, M. Jones, P. K. Agarwal and T. M. Murali: A Monte Carlo algorithm for fast projective clustering. SIGMOD, 2002".
Real world datasets
For every dataset a zip package with two files is available: The graphml file containing the vertex-labeled graph and the used config file with the parameters.
The used gene interaction network is available here (Gene.zip). The gene expressions that are used as node attributes are taken from http://thebiogrid.org. Gene interactions from http://genomebiology.com/2005/6/3/R22 were used as the edges of the graph. The resulting network contains 3548 nodes with 115 attributes and 8334 edges.
In our paper we used an extract of the Arxiv Database which was taken from http://www.cs.cornell.edu/projects/kddcup/datasets.html. The resulting graph file is available here (Arxiv.zip). In this graph, texts are represented as nodes and citations are represented as edges. The 300 attributes of the nodes correspond to 300 keywords such that each attribute value of a node shows how often the respective keyword appears in the text that corresponds to the node. We restricted the graph to the texts from the years 1999-2003. The used graph consists of 13003 nodes and 120213 edgegs.
The used patent information graph can be downloaded here (Patent.zip). The original data was taken from http://www.nber.org/patents. For our graph, we used the patent data from the years 1991-1995. Each patent is represented as a node in the graph. Citations between patents are represented as the edges of the graph. The graph contains contains 492007 nodes with 5 dimensions and 528333 edges.
We also used an extract of the DBLP database which represents the top eleven conferences. We constructed a co-author-graph were each node corresponds to an author and each edge corresponds to a co-authorship between two authors. The 11 attributes of a node represent the conferences which the author attended. The resulting graph data can be found here (DBLP.zip). The graph contains 2482 nodes and 7438 edges.
For evaluating the scalability and the clustering quality of the algorithms depending on certain properties of the data we generated synthetic datasets. For every experiment, the used graph files and the used config files are available in a zip file.
Varying the number of nodes
In this experiment we vary the number of nodes in the graph by varying the number of generated clusters. In every graph the attribute data are 20 dimensional. Every generated cluster consists of 15 nodes, has 5 relevant dimensions and a quasi-clique density of 0.6.
Varying the number of nodes per cluster
In this experiment we vary the number of nodes in the graph by varying the number of nodes per generated cluster. In every graph the attribute data are 20 dimensional. Every generated graph contains 80 clusters with 5 relevant dimensions and a quasi-clique density of 0.6.
Varying the dimensionality of the datasets
In this experiment we vary the dimensionality of the datasets. In every graph there are 80 generated clusters. Every cluster consists of 15 nodes, has 5 relevant dimensions and a quasi-clique density of 0.6.
Varying the number of noise edges
In this experiment we vary the number of noise edges, i.e. the number of edges that do not belong to any cluster. Every graph contains 100 generated clusters consisting of 10 nodes, 15 relevant dimensions and has a quasi-clique density of 0.6. The attribute data are 20 dimensional.
Varying the degree of overlap between the clusters
In this experiment we vary the degree of overlap between the clusters. This degree describes the maximal number of clusters a single node can belong to. In every graph the attribute are 20 dimensional. Each graph contains 80 clusters with 15 nodes, 5 relevant dimensions and has a quasi-clique density of 0.6.
Varying the minimum density
In this experiment we use a single graph (20 dimensional attribute data, 80 clusters with 10 nodes, 5 relevant dimensions and different densities from 0.5 to 0.8) and vary the parameter for the minimum density.
Evaluation of the clustering results
We evaluated our clustering results on synthetic data using the F1 value. This is a measure that compares the found clusters with the generated "true" clusters. As for real data no true clusters are given, we cannot use this measure with real data.
For each generated graph the true clusters are given in a file (xy.true) which as the same format as the result file above. The files for our synthetic data can be downloaded here: Truefiles.zip
For calculating the F1 value for a found clustering and a true clustering we provide another jar file (CalculateQuality.jar). This executable expects as parameters the names of the ".found" file and the respective ".true" file.
For example a command line would be:
java -jar CalculateQuality.jar ClusterDim5_gamer.found ClusterDim5.true
If you publish material based on databases, algorithms, parameter settings or evaluation measures obtained from this repository, then, in your acknowledgments, please note the assistance you received by using this repository. This will help others to obtain the same data sets, algorithms, parameter settings and evaluation measures and replicate your experiments. We suggest the following reference format for referring to this project:
S., Färber I., Boden B., Seidl T.:
Subspace Clustering Meets Dense Subgraph Mining: A Synthesis of Two Paradigms
In Proc. IEEE International Conference on Data Mining (ICDM 2010), Sydney, Australia. (2010)