As an ongoing trend, tremendously increasing amounts of data are collected in real-world applications of life science, engineering, telecommunication, business transactions and many other domains. For the management and analysis of these data, many different techniques and algorithms have been developed, ranging from basic database operations to high-level data mining approaches like clustering, classification or the detection of outliers. Processing huge data sets with millions or billions of records on a single computer exceeds the computation capabilities of single computing nodes due to limitations of disk space and/or main memory. Thus, it is indispensable to develop distributed approaches that run on clusters of several computers in parallel.
For the development of distributed algorithms, a variety of structured programming models exists. Aside classic parallel programming, the MapReduce model was proposed by Google, and its open-source implementation Hadoop found wide-spread attention and usage.
In MapReduce, the data is given a
list of records that are represented as (key, value) pairs. Basically, a MapReduce program consists of two phases: In the "Map" phase, the records are arbitrarily distributed to different computing nodes (called "mappers") and each record is processed separately, independent of the other data items. The map phase then outputs intermediate (key, value) pairs. In the "Reduce" phase, records having the same key are grouped together and processed in the same computing node ("reducer"). Thus, the reducers combine information of different records having the same key and aggregate the intermediate results of the mappers. The results are stored back to the distributed file system.
On top of this new programming model, Hadoop and other implementations of the MapReduce framework show a lot of non-functional advantages: They are scalable to clusters of many computing nodes, which are easily expanded by new nodes. They are fault-tolerant: If one of the computing nodes fails during the execution of the program, the work of the other nodes is not affected or discarded, just the records that were currently processed on the failing node have to be processed again by another node. This fault tolerance particularly supports running Hadoop on commodity hardware. For example, organizations often have tens or hundreds of desktop computers which are only used at certain times of day and to the most part just for office applications. These computers often show much unused capacity in terms of processor time and disk space. Using Hadoop, these available resources can be easily used for distributed computing.
The goal of the research is the development of high parallelizable data mining techniques with MapReduce framework.