Spectral Subspace Clustering for Graphs with Feature Vectors
by Stephan Günnemann, Ines Färber, Sebastian Raubach, Thomas Seidl
in Proc. IEEE International Conference on Data Mining (ICDM), Dallas, TX, USA, 2013
On this page we offer the datasets and detailed result evaluations that were used for the experiments in our paper "Spectral Subspace Clustering for Graphs with Feature Vectors". Thus, ensuring repeatability and comparison for the data mining community.
Real world datasets
All datasets are available in data.zip containing the graphml file for each vertex-labeled graph.
We used an extract of the DBLP database to construct a co-author-graph were each node corresponds to an author and each edge corresponds to a co-authorship between two authors. The features consist of 20 keywords extracted from the titles of papers. They are "classification", "cluster", "graphic", "human", "knowledge", "learning", "logic", "machine", "motion", "pattern", "privacy", "query", "relational", "retrieval", "semantic", "subspace", "support", "surface", "3d" and "time". The resulting graph contains 774 nodes and 1757 edges.
We extracted the 100 goal getters of the German soccer premier league from the website weltfussball.de. Each node represents a player. Two players are connected if they played in the same soccer club (not necessarily at the same time). As features we chose "number of games", "number of goals", "number of penalty kicks", "average number of goals per game", and "number of soccer clubs".
In our paper we used an extract of the Arxiv Database which was taken from http://www.cs.cornell.edu/projects/kddcup/datasets.html. In the resulting graph, texts are represented as nodes and citations are represented as edges. In the larger version of the dataset the attributes represent 300 keywords where the numerical value indicates how often the respective keyword appears in the text of the paper corresponds to the node. The graph consists of 11989 nodes and 119258 edges. In the smaller version of the dataset we used the top 30 keywords and 856 nodes and 2660 edges.
The gene expressions that are used as node attributes are taken from http://thebiogrid.org. Gene interactions from http://genomebiology.com/2005/6/3/R22 were used as the edges of the graph. The resulting network contains 2900 nodes with 115 attributes and 8264 edges.
We extract data from IMDb. We used movies produced in USA, Canada, UK, or Germany with at least 200 rankings and an average ranking of at least 6.5 as nodes. Two movies are connected if they share actors or if there exists a reference (e.g. spoofs or follow ups) to each other. As features, we chose all 21 movie genres. ocus on movies. We used the largest connected component containing 862 nodes and 4388 edges.
The original patent data was taken from http://www.nber.org/patents. For our graph, we used a subset of patent data from the years 1991-1995. Each patent is represented as a node in the graph. Citations between patents are represented as the edges of the graph. The graph contains contains 100000 nodes with 5 dimensions and 188631 edges.
The extended evaluation results showing the pairwise normalized mutual information between all clusterings can be found here.