Efficient Streaming Detection of Hidden Clusters in Big Data Using Subspace Stream Clustering

Recently, many data mining techniques were revisited to cope with the new big data challenges. Nearly all of these algorithms considered the efficiency of the mining algorithm to survive the increasing size of the data. However, as the dimensionality of the data increases, not only the efficiency but also the effectiveness of traditional mining algorithms is compromised. For instance, clusters hidden in some subspaces are hard to be detected using traditional clustering algorithms, as the dimensionality of the data increases.
In this paper, we consider both the huge size, and the high dimensionality of big data by providing a novel solution that presents a three-phase model for subspace stream clustering algorithms. Our novel model, overcomes the huge size of the big data in its first phase, by continuously applying a streaming concept over the huge data objects, and summarizing them into micro-clusters. Then, after each certain batch of data, or after upon a user request, the second phase is applied over the data summarized in micro-clusters, to reconstruct the current distribution of the data out of the current summaries.
In the third phase, a subspace clustering algorithm is applied to overcome the high dimensionality of the data, and to find hidden clusters within some subspace. An extensive evaluation study over different scenarios that follow our model over a big data set is performed.

Authors: Hassani M., Seidl T.
Published in: Workshop on Big Data Management and Analytics (BDMA'14), held in conjunction with DASFAA'14 conference (19th International Conference on Database Systems for Advanced Applications, Bali, Indonesia)
Publisher: Springer
Language: EN
Year: 2014
Pages: 146-160
ISBN: 978-3-662-43984-5
ISSN: 0302-9743
Conference: DASFAA
Type: Conference papers (peer reviewed)
Research topic: