Projected Clustering for Huge Data Sets in MapReduce

Fast growing data sets with a very high number of attributes become a common situation in social, industry and scientific areas. A meaningful analysis of these data sets requires sophisticated data mining techniques as projected clustering that are able to deal with

such complex data.
In this work, we investigate solutions for extending the state-of-the-art projected clustering algorithm P3C for large data sets in high-dimensional spaces. We show that the original model of the P3C algorithm is not suitable to deal with huge data sets. Therefore, we
propose the necessary changes of the underlying clustering model and then present an efficient MapReduce-based implementation - our novel P3C+-MR algorithm. The effectiveness of the proposed changes on large data sets and the efficiency of the P3C+-MR algorithm are comprehensively evaluated on synthetic and real-world data sets. Additionally, we propose the P3C+-MR-Light algorithm, a simplified version of P3C+-MR that shows extraordinary good
results in terms of runtime and result quality on large data sets. In the end, we compare our solutions to existing approaches.

Authors: Fries S., Wels S., Seidl T.
Published in: International Conference on Extending Database Technology (EDBT 2014), Athens, Greece
Publisher: OpenProceedings.org
Language: EN
Year: 2014
Pages: 49-60
Conference: EDBT
DOI:http://dx.doi.org/10.5441/002/edbt.2014.06
Url:EDBT 2014
Type: Conference papers (peer reviewed)
Research topic: Data Analysis and Knowledge Extraction