Tutorial: Discovering Multiple Clustering Solutions

Grouping Objects in Different Views of the Data

 

Traditional clustering algorithms identify just a single clustering of the data. Today's complex data, however, allows multiple interpretations leading to several valid groupings hidden in different views of the database. Each of these multiple clustering solutions is valuable and interesting as different perspectives on the same data and several meaningful groupings for each object are given. Especially for high dimensional data where each object is described by multiple attributes, alternative clusters in different attribute subsets are of major interest.

 

In this tutorial, we describe several real world application scenarios for multiple clustering solutions. We abstract from these scenarios and provide the general challenges in this emerging research area. We describe state-of-the-art paradigms, we highlight specific techniques, and we give an overview of this topic by providing a taxonomy of the existing methods. By focusing on open challenges, we try to attract young researchers for participating in this emerging research field.

 

Our target is a broad audience ranging from researches focusing on formal methods up to application oriented practitioners from industry. Researches from the area of clustering but also from related directions might contribute to this field. For all of these people, multiple clustering solutions will be of great interest.

 

Presentations of this tutorial at IEEE International Conference on Data Mining (ICDM 2010), SIAM Conference on Data Mining (SDM 2011), IEEE International Conference on Data Engineering (ICDE 2012), Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD 2012), and recent MultiClust workshops at KDD 2010, ECML PKDD 2011, and SDM 2012 have initiated a broad interest in this emerging topic. Hence, we are currently organizing a special issue on the topic of "Multiple Clustering Solutions". The MultiClust Special Issue on Discovering, Summarizing and Using Multiple Clusterings will be published in the Machine Learning Journal. The special issue is addressed to the emerging communities in Database Systems, Machine Learning, Data Mining, and Statistics that are contributing to this topic. Additionally, we will present an updated version of our tutorial at the International Conference on Machine Learning (ICML 2013).

 

Download the latest version of tutorial slides:

Tutorial slides "Discovering Multiple Clustering Solutions"

 

Dr. Emmanuel Müller (Karlsruhe Institute of Technology, Germany)

Emmanuel Müller is a senior researcher and lecturer at the institute for program structures and data organization at the Karlsruhe Institute of Technology (KIT), Germany. In the past years, he was a research assistant in computer science at the data management and data exploration group at RWTH Aachen University, Germany. His research interests cover efficient data mining in high dimensional data, detection of clusters in subspace projections and outlier detection. Leading the open-source initiative OpenSubspace he provides a general contribution to the research community especially by a repeatable and comparable evaluation study on recent data mining approaches. Dr. Müller received his Diplom (MSc) in 2007 and his PhD in 2010 from RWTH Aachen University. He is active member of program committees such as SDM, ECML PKDD, and recent MultiClust-Workshops.

Dr. Stephan Günnemann (Carnegie Mellon University, USA)

Stephan Günnemann is a Postdoctoral researcher at the Department of Computer Science, Carnegie Mellon University, USA. In the past years, he has been a research associate in computer science at the data management and data exploration group at RWTH Aachen University, Germany where he has been supported by the UMIC research cluster of the DFG. His research interests include graph mining and the mining of non-redundant, multiple clustering solutions for high dimensional and structured data. He contributes to the open source initiative OpenSubspace for the evaluation and exploration of subspace clustering algorithms. Dr. Günnemann received his Diplom (MSc) in 2008 and his PhD in 2012 from RWTH Aachen University.

Ines Färber (RWTH Aachen University, Germany)

Ines Färber is a PhD student and research assistant in computer science at the data management and data exploration group at RWTH Aachen University, Germany. Her research interests include mining of alternative and multi-view clustering solutions for high dimensional data. She contributes to the OpenSubspace initiative for evaluation and exploration of multiple clustering solutions. Ines Färber received her Diplom (MSc) in 2009 from RWTH Aachen University.

Prof. Dr. Thomas Seidl (RWTH Aachen University, Germany)

Thomas Seidl is a professor for computer science and head of the data management and data exploration group at RWTH Aachen University, Germany. His research interests include data mining and database technology for multimedia and spatio-temporal databases in engineering, communication and life science applications. Prof. Seidl received his Diplom (MSc) in 1992 from TU Muenchen and his PhD (1997) and venia legendi (2001) from LMU Muenchen. He is active member of several program committees including ACM SIGKDD, IEEE ICDE, SDM, recent MultiClust-Workshops and others. He is member of the editorial board of The VLDB Journal as associate editor.

Acknowledgements:

This research was funded in part by the cluster of excellence on Ultra-high speed Mobile Information and Communication (UMIC) of the DFG (German Research Foundation grant EXC 89).

References:

  • Aggarwal, C., & Yu, P.: Finding generalized projected clusters in high dimensional spaces. In: SIGMOD 2000
  • Aggarwal, C., Wolf, J., Yu, P., Procopiuc, C., & Park, J.: Fast algorithms for projected clustering. In: SIGMOD 1999
  • Agrawal, R., & Srikant, R.: Fast Algorithms for mining Association Rules. In: VLDB 1994
  • Agrawal, R., Gehrke, J., Gunopulos, D., & Raghavan, P.: Automatic subspace clustering of high dimensional data for data mining applications. In: SIGMOD 1998
  • Assent, I., Krieger, R., Müller, E., & Seidl, T.: DUSC: Dimensionality Unbiased Subspace Clustering. In: ICDM 2007
  • Assent, I., Krieger, R., Müller, E., & Seidl, T.: INSCY: Indexing Subspace Clusters with In-Process-Removal of Redundancy. In: ICDM 2008
  • Bae, Eric, & Bailey, James.: COALA: A Novel Approach for the Extraction of an Alternate Clustering of High Quality and High Dissimilarity. In: ICDM 2006
  • Bae, Eric, Bailey, James, & Dong, Guozhu.: A clustering comparison measure using density profiles and its application to the discovery of alternate clusterings. Data Min. Knowl. Discov., 21(3).
  • Beyer, K., Goldstein, J., Ramakrishnan, R., & Shaft, U.: When is nearest neighbors meaningful. In: IDBT 1999
  • Bickel, Steffen, & Scheffer, Tobias.: Multi-View Clustering. In: ICDM 2004
  • Blum, A., & Mitchell, T.: Combining labeled and unlabeled data with co-training. In: COLT 1998
  • Böhm, C., Kailing, K., Kriegel, H.-P., & Kröger, P.: Density Connected Clustering with Local Subspace Preferences. In: ICDM 2004
  • Böhm, Christian, Kailing, Karin, Kröger, Peer, & Zimek, Arthur.: Computing Clusters of Correlation Connected objects. In: SIGMOD 2004
  • Caruana, Rich, Elhawary, Mohamed Farid, Nguyen, Nam, & Smith, Casey.: Meta Clustering. In: ICDM 2006
  • Chechik, Gal, & Tishby, Naftali.: Extracting Relevant Structures with Side Information. In: NIPS 2002
  • Cheng, C.-H., Fu, A. W., & Zhang, Y.: Entropy-based subspace clustering for mining numerical data. In: SIGKDD 1999
  • Cordeiro, R., Traina, A., Faloutsos, C., & Traina, C.: Finding Clusters in Subspaces of Very Large Multi-dimensional Datasets. In: ICDE 2010
  • Cui, Ying, Fern, Xiaoli Z., & Dy, Jennifer G.: Non-redundant Multi-view Clustering via Orthogonalization. In: ICDM 2007
  • Cui, Ying, Fern, Xiaoli Z., & Dy, Jennifer G. Learning multiple nonredundant clusterings. TKDD, 4(3).
  • Dang, Xuan Hong, & Bailey, James.: Generation of Alternative Clusterings Using the CAMI Approach. In: SDM 2010
  • Dang, Xuan Hong, & Bailey, James.: A hierarchical information theoretic technique for the discovery of non linear alternative clusterings. In: SIGKDD 2010
  • Davidson, Ian, & Qi, Zijie.: Finding Alternative Clusterings Using Constraints. In: ICDM 2008
  • de Sa, Virginia R.: Spectral clustering with two views. In: ICML 2005 Workshop on Learning with Multiple Views
  • Domeniconi, Carlotta, & Al-Razgan, Muna.: Weighted cluster ensembles: Methods and analysis. TKDD, 2(4).
  • Ester, M., Kriegel, H.-P., Sander, J., & Xu, X.: A density-based algorithm for discovering clusters in large spatial databases. In: SIGKDD 1996
  • Fern, Xiaoli Zhang, & Brodley, Carla E.: Random Projection for High Dimensional Data Clustering: A Cluster Ensemble Approach. In: ICML 2003
  • Gondek, D., & Hofmann, T.: Conditional information bottleneck clustering. In: ICDM 2003, Workshop on Clustering Large Data Sets.
  • Gondek, David, & Hofmann, Thomas.: Non-Redundant Data Clustering. In: ICDM 2004
  • Gondek, David, & Hofmann, Thomas.: Non-redundant clustering with conditional ensembles. In: SIGKDD 2005
  • Gondek, David, Vaithyanathan, Shivakumar, & Garg, Ashutosh.: Clustering with Model-level Constraints. In: SDM 2005
  • Gretton, A., Bousquet, O., Smola, A., & Schölkopf, B.: Measuring statistical dependence with hilbertschmidt norms. In: Algorithmic Learning Theory 2005
  • Günnemann, S., Färber, I., & Seidl, T.: Multi-View Clustering Using Mixture Models in Subspace Projections. In: SIGKDD 2012
  • Günnemann, S., Müller, E., Färber, I., & Seidl, T.: Detection of Orthogonal Concepts in Subspaces of High Dimensional Data. In: CIKM 2009
  • Günnemann, S., Färber, I., Müller, E., & Seidl, T.: ASCLU: Alternative Subspace Clustering. In: MultiClust Workshop at SIGKDD 2010
  • Hossain, M. Shahriar, Tadepalli, Satish, Watson, Layne T., Davidson, Ian, Helm, Richard F., & Ramakrishnan, Naren.: Unifying dependent clustering and disparate clustering for non-homogeneous data. In: SIGKDD 2010
  • Jain, Prateek, Meka, Raghu, & Dhillon, Inderjit S.: Simultaneous Unsupervised Learning of Disparate Clusterings. In: SDM 2008
  • Januzaj, Eshref, Kriegel, Hans-Peter, & Pfeifle, Martin.: Scalable Density-Based Distributed Clustering. In: PKDD 2004
  • Kailing, K., Kriegel, H.-P., Kröger, P., & Wanka, S.: Ranking interesting subspaces for clustering high dimensional data. In: PKDD 2003
  • Kailing, K., Kriegel, H.-P., Pryakhin, A., & Schubert, M.: Clustering Multi-Represented Objects with Noise. In: PAKDD 2004
  • Kailing, K., Kriegel, H.-P., & Kröger, P.: Density-Connected Subspace Clustering for High-Dimensional Data. In: SDM 2004
  • Kriegel, Hans-Peter, Kröger, Peer, Renz, Matthias, & Wurst, Sebastian.: A Generic Framework for Efficient Subspace Clustering of High-Dimensional Data. In: ICDM 2005
  • Kriegel, Hans-Peter, Kröger, Peer, & Zimek, Arthur.: Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering. TKDD, 3(1).
  • Long, Bo, Yu, Philip S., & Zhang, Zhongfei.: A General Model for Multiple View Unsupervised Learning. In: SDM 2008
  • Moise, Gabriela, & Sander, Jörg.: Finding non-redundant, statistically significant regions in high dimensional data: a novel approach to projected and subspace clustering. In: SIGKDD 2008
  • Moise, Gabriela, Sander, Joerg, & Ester, Martin.: P3C: A Robust Projected Clustering Algorithm. In: ICDM 2006
  • Müller, E., Assent, I., Krieger, R., Günnemann, S., & Seidl, T.: DensEst: Density Estimation for Data Mining in High Dimensional Spaces. In: SDM 2009
  • Müller, E., Günnemann, S., Assent, I., & Seidl, T.: Evaluating Clustering in Subspace Projections of High Dimensional Data. In: VLDB 2009
  • Müller, E., Assent, I., Günnemann, S., Krieger, R., & Seidl, T.: Relevant Subspace Clustering: Mining the Most Interesting Non-Redundant Concepts in High Dimensional Data. In: ICDM 2009
  • Nagesh, H., Goil, S., & Choudhary, A.: Adaptive grids for clustering massive data sets. In: SDM 2001
  • Ng, A., Jordan, M., & Weiss, Y.: On spectral clustering: Analysis and an algorithm. Advances in Neural Information Processing Systems, 14.
  • Niu, Donglin, & Dy, Jennifer G.: Multiple Non-Redundant Spectral Clustering Views. In: ICML 2010
  • Parsons, Lance, Haque, Ehtesham, & Liu, Huan.: Subspace clustering for high dimensional data: a review. SIGKDD Explorations, 6(1).
  • Procopiuc, C. M., Jones, M., Agarwal, P. K., & Murali, T. M.: A Monte Carlo algorithm for fast projective clustering. In: SIGMOD 2002
  • Qi, Zijie, & Davidson, Ian.: A principled and flexible framework for finding alternative clusterings. In: SIGKDD 2009
  • Sequeira, K., & Zaki, M.: SCHISM: A New Approach for Interesting Subspace Mining. In: ICDM 2004
  • Strehl, Alexander, & Ghosh, Joydeep.: Cluster Ensembles — A Knowledge Reuse Framework for Combining Multiple Partitions. J. of Machine Learning Research, 3
  • Vinh, Nguyen Xuan, & Epps, Julien.: minCEntropy: a Novel Information Theoretic Approach for the Generation of Alternative Clusterings. In: ICDM 2010
  • Wiswedel, Bernd, Höppner, Frank, & Berthold, Michael R.: Learning in parallel universes. Data Min. Knowl. Discov., 21(1).
  • Yiu, M. L., & Mamoulis, N.: Frequent-pattern based iterative projected clustering. In: ICDM 2003
  • Zhou, D., & Burges, C. J. C.: Spectral clustering and transductive learning with multiple views. In: ICML 2007.