Efficient Clustering of Big Data Streams
Recent advances in data collecting devices and data storage systems are continuously offering cheaper possibilities for gathering and storing increasingly bigger volumes of data. Similar improvements in the processing power and data bases enabled the accessibility to a large variety of complex data. Data mining is the task of extracting useful patterns and previously unknown knowledge out of this voluminous, various data. This thesis focuses on the data mining task of clustering, i.e. grouping objects into clusters such that similar objects are assigned to the same cluster while dissimilar ones are assigned to different clusters. While traditional clustering algorithms merely considered static data, today's applications and research issues in data mining have to deal with continuous, possibly infinite streams of data, arriving at high velocity. Web traffic data, click streams, surveillance data, sensor measurements, customer profile data and stock trading are only some examples of these daily-increasing applications.
Since the growth of data sizes is accompanied by a similar raise in their dimensionalities, clusters cannot be expected to completely appear when considering all attributes together. Subspace clustering is a general approach that solved that issue by automatically finding the hidden clusters within different subsets of the attributes rather than considering all attributes together.
In this thesis, novel methods for an efficient subspace clustering of high-dimensional data streams are presented and deeply evaluated. Approaches that efficiently combine the anytime clustering concept with the stream subspace clustering paradigm are intensively studied. Additionally, efficient and adaptive density-based clustering algorithms are presented for high-dimensional data streams. New algorithmic solutions for an energy-efficient in-sensor-network aggregation and a light-weighted clustering are presented for sensor streaming data. Novel open-source assessment framework and evaluation measures are presented for subspace stream clustering.
Primarily, efficient models of advanced and complex clustering tasks are for the first time contributed for data streams.
|Published in:||Dissertation, Fakultät für Mathematik, Informatik und Naturwissenschaften, RWTH Aachen University|
|Publisher:||Apprimus-Verlag - Aachen|
Date of the oral exam: 26.01.2015. URN: urn:nbn:de:hbz:82-RWTH-2015-02790
|Forschungsgebiet:||Data Analysis and Knowledge Extraction|