Anytime Stream Mining

Research topic: Data Analysis and Knowledge Extraction

Management of data streams plays an important role, especially data mining tasks such as clustering, classification, aggregation, prediction and identification of relevant data. Due to the increasing volume of the data, it is no longer possible to buffer a stream and to process the data by using multiple passes. Thus the underlying algorithms for mining data streams have to be designed in such a way that each data item is accessed at most once. There can be the requirement to provide results very fast, e.g. for peak load situations. For other tasks this requirement is not given but the luxury of additional time, with which a quality up to the best possible result can be achieved. Under greatly varying time constraints of apriori unknown stream inter-arrival rates, anytime algorithms provide the best result up to a point of interruption dictated through the arrival of the next stream element. For many mining tasks traditional algorithms are known that provide good results, yet cannot be interrupted in a meaningful manner. We therefore focus on such adaptive techniques for stream mining that enable interruptions at any time and that improve the quality of their results with more execution time available.

 

Data streams have naturally a temporal component and usually change over time. Mining algorithms have to be optimized for this case so that they are aware of the evolution of the data during the stream. The evolution of the underlying data distribution model is referred to as concept drift and novelty. Algorithms that try to find a model for the distribution of a given data set often need a considerable amount of time. To be able to deal with concept drift and novelty of very fast data streams, we therefore examine algorithms for modeling stream data distributions that support incremental learning. Other mining tasks like ranking and top-k queries search for the most interesting data or most relevant dimensions based on characteristic measures. However, as the data stream proceeds, previous results may become invalid with respect to recently arrived data items. Thus, maintaining correct result in a data stream environment, e.g. to a top-k query, makes efficient continuous query processing and incremental algorithms necessary.

 

Anytime algorithms are capable of dealing with the varying time constraints and high data volumes as described above. The advantages of anytime algorithms can be summarized as flexibility (exploit all available time), interruptibility (provide a decision at any time of interruption) and incremental improvement (continue improvement from current position without restart).