SOREX: Subspace Outlier Ranking Exploration Toolkit

An Open Source Framework for Evaluation and Exploration of Subspace Outlier Ranking Algorithms in WEKA

 

Emmanuel Müller, Matthias Schiffer, Patrick Gerwert, Matthias Hannen, Timm Jansen and Thomas Seidl

 

For an easy exploration of subspace outliers we integrated several recent subspace outlier ranking approaches into the popular WEKA framework. Using our framework one can run all competing approaches on arbitrary data sets given in the ARFF data format. Based on the WEKA framework general pre-processing already available in WEKA can be used during exploration of outliers in SOREX.

 

In previous work, we have extended the WEKA framework to subspace and projected clustering a fundamental project on which SOREX is based on. A short description of this extension and how to use it can be found on our OpenSubspace project website. Initially this project has been designed for interactive exploration of clustering results [Morpheus: Interactive exploration of subspace clustering presented at KDD 2008]. The framework has been used in our recent evaluation study as baseline ensuring comparability and repeatability of experiments [Evaluating Clustering in Subspace Projections of High Dimensional Data at VLDB 2009].

 

For our work on Subspace Outlier Ranking we extended the OpenSubspace framework by novel abstract classes and interfaces supporting the requirements of recent subspace outlier ranking algorithms. SOREX provides a new outlier tab as depicted in the following screenshots. In this tab outlier mining algorithms can be selected out of our algorithm repository covering subspace outlier ranking and some traditional (full space) outlier ranking approaches.

 

 

 

A simple text output is provided to give basic results on each algorithm run. It includes evaluation measures like AUC measures for ROC-plots and further objective evaluation measures as used in the outlier ranking literature. Our novel descriptive components can be loaded by a right click on the algorithm run. Users may choose out of several visualizations. For an overview of the result one can use the ranking plot. Starting from this plot more detailed information about the outlier properties can be selected. Additional plots show the relevant subspace projections which provide the reasons for outlierness.

 

Overall the visualization and exploration can be used to verify found subspace outliers but also to evaluate the outlier ranking results. Closing the loop of the KDD cycle one can try better parameter settings or even compare various parametrization steps in a bracketing procedure.

 

Resources

 

Executables and test data sets for subspace outlier exploration: SOREX.zip

For testing we recommend the OUTRANK_PROCLUS algorithm with low runtimes even for the large pendigits dataset.

 

We encourage researchers in this area to use the proposed exploration toolkit for their own publications to explore results of competing approaches or to implement novel subspace outlier ranking methods into this framework.

Citation Information

If you publish material based on databases, algorithms or evaluation measures obtained from this repository, then, in your acknowledgments, please note the assistance you received by using this repository. This will help others to obtain the same data sets, algorithms and evaluation measures and replicate your experiments. We suggest the following reference format for referring to this project:

Müller E., Schiffer M., Gerwert P., Hannen M., Jansen T., Seidl T.:
SOREX: Subspace Outlier Ranking Exploration Toolkit

http://dme.rwth-aachen.de/OpenSubspace/SOREX

In Proc. (ECML PKDD 2010), Barcelona, Spain.