Efficient Mining of Combined Subspace and Subgraph Clusters in Graphs with Feature Vectors

by Stephan Günnemann, Brigitte Boden, Ines Färber, Thomas Seidl

in Proceedings of the Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD 2013), Gold Coast, Australia

 

Complexity Proofs and Supplementary Material

 

On this page we offer the datasets and algorithms that were used for the experiments in our paper "Efficient Mining of Combined Subspace and Subgraph Clusters in Graphs with Feature Vectors". Thus, repeatability and comparison are available for the data mining community. Furthermore, we provide the detailed complexity proofs that were omitted in our paper due to space limitations. Please note the citation information.

 

 

Complexity proofs

 

The detailed complexity proofs that were omitted in our paper due to space limitations can be downloaded here: ComplexityProofs.pdf

 

 

Algorithms

 

Our executables are available as a jar file (download ClusterDetectionEdcar.jar). The usage and the used formats are explained in the README file: README.txt.

 

The jar file expects two parameters: The name of the algorithm that shall be executed (possiblitities are edcar, gamer, copam and cocain) and a config file which has the format of a Java Properties file. This config file contains the file name of the processed graph (the graph has to be represented in the graphml format, c.f. http://graphml.graphdrawing.org) and the values of the required parameters.

 

For example, for repeating our experiment on the gene data with EDCAR, the command line would be:

 

java -jar ClusterDetection.jar edcar bio.properties

 

 

After processing, the algorithms will create a file containing the result. Here we present a tiny example that explains the contents of such a file:

 

 

 

In such a result file, every line corresponds to one identified cluster. The last line contains the processing time of the algorithm (in ms). The algorithms EDCAR and GAMer also output the quality sum of the found clustering.

 

For every cluster, the first numbers correspond to the relevant dimensions of the cluster. In our example, the original dataset has three dimensions, so the first three numbers show the dimensions of the clusters, e.g. the first cluster lies only in the first dimension, indicated by the entry '1'. The second and the third dimension are not relevant for this cluster, indicated by the entry '0'.

 

The next number shows the number of nodes in the cluster, e.g. '5' for the first cluster.

 

The last numbers are the IDs of the nodes that the cluster contains. For the first cluster that would be the nodes with the IDs 0, 1, 2, 3 and 4.

 

Real world datasets

 

For every dataset a zip package with two files is available: The graphml file containing the vertex-labeled graph and the used config file with the parameters.

 

Gene Data

The used gene interaction network is available here (gene.zip ). The gene expressions that are used as node attributes are taken from http://thebiogrid.org. Gene interactions from http://genomebiology.com/2005/6/3/R22 were used as the edges of the graph. The resulting network contains 3548 nodes with 115 attributes and 8334 edges.

 

Arxiv Database

In our paper we used an extract of the Arxiv Database which was taken from http://www.cs.cornell.edu/projects/kddcup/datasets.html. The resulting graph file is available here (arxiv.zip). In this graph, texts are represented as nodes and citations are represented as edges. The 300 attributes of the nodes correspond to 300 keywords such that each attribute value of a node shows how often the respective keyword appears in the text that corresponds to the node. The used graph consists of 27769 nodes and 352284 edgegs.

 

Patent Information

The used patent information graph can be downloaded here (patent.zip). The original data was taken from http://www.nber.org/patents. For our graph, we used the patent data from the years 1991-1995. Each patent is represented as a node in the graph. Citations between patents are represented as the edges of the graph. The graph contains contains 492007 nodes with 5 dimensions and 528333 edges.

 

DBLP

We also used an extract of the DBLP database. We constructed a co-author-graph were each node corresponds to an author and each edge corresponds to a co-authorship between two authors. The 2695 attributes of a node represent the conferences which the author attended. The resulting graph data can be found here (dblp.zip). The graph contains 133097 nodes and 631384 edges.

 

 

Synthetic datasets

 

 

For evaluating the scalability and the clustering quality of the algorithms depending on certain properties of the data we generated synthetic datasets. For every experiment, the used graph files, the used config files and the hidden clusters (cf. Evaluation of the clustering results) are available in a zip file.

 

Varying the number of nodes

In one experiment we vary the number of nodes in the graph by varying the number of generated clusters. In every graph the attribute data are 20 dimensional. Every generated cluster consists of 15 nodes, has 5 relevant dimensions and a quasi-clique density of 0.6.

numberofclusters.zip

 

In a second experiment we vary the number of nodes in the graph by varying the number of nodes per generated cluster. Every dataset is 20 dimensional and consists of 80 clusters. Every cluster has 5 relevant dimensions and a quasi-clique density of 0.6.

clustersize.zip

 

Varying the number of edges

In one experiment we vary the number of edges in the graph by varying the quasi-clique density of generated clusters, i.e. we insert more and more edges in the clusters. The datasets are 20 dimensional. In every graph there are 60 generated clusters with 5 relevant dimensions and 15 nodes.

density.zip

 

Varying the dimensionality of the clusters

In this experiment we vary the dimensionality of the generated clusters.  In every graph the attribute data are 50 dimensional and there are 80 generated clusters. Every cluster consists of 15 nodes and has a quasi-clique density of 0.6.

clusterdim.zip

 

Varying the dimensionality of the datasets

In this experiment we vary the dimensionality of the datasets. In every graph there are 80 generated clusters . Every cluster consists of 15 nodes, has 5 relevant dimensions and a quasi-clique density of 0.6.

dimensionality10to400.zip  and dimensionality500to1000.zip

 

 

Varying the parameter "MaxClusterPerIter"

In this experiment we vary EDCAR's parameter "MaxClusterPerIter" which determines how many paths in the set enumeration tree are generated in each iteration. This was tested on three different datasets. The datasets are taken from the first experiment (varying the number of clusters): One dataset with 10 clusters (ca. 250 nodes), one dataset with 30 clusters (ca. 500 nodes) and one with 60 clusters (ca. 1000 nodes).

maxclusterperiter.zip

 

Evaluation of the clustering results

 

We evaluated our clustering results on synthetic data using the F1 value. This is a measure that compares the found clusters with the generated "true" clusters. As for real data no true clusters are given, we cannot use this measure with real data.

 

For each generated graph the true clusters are given in a file (xy.true) which as the same format as the result file above. The files for our synthetic data can be found in the respective .zip files.

 

For calculating the F1 value for a found clustering and a true clustering we provide another jar file (CalculateQuality.jar). This executable expects as parameters the names of the ".found" file and the respective ".true" file.

 

For example a command line would be:

 

java -jar CalculateQuality.jar ClusterDim5_edcar.found ClusterDim5.true

 

 

Citation Information


If you publish material based on databases, algorithms, parameter settings or evaluation measures obtained from this repository, then, in your acknowledgments, please note the assistance you received by using this repository. This will help others to obtain the same data sets, algorithms, parameter settings and evaluation measures and replicate your experiments. We suggest the following reference format for referring to this project:

 

Günnemann S., Boden B., Färber I., Seidl T.:
Efficient Mining of Combined Subspace and Subgraph Clusters in Graphs with Feature Vectors

http://dme.rwth-aachen.de/edcar/

In Proc. of the Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD 2013), Gold Coast, Australia. (2013)