Large Scale Learning
Much of my current research focuses on kernel methods such as Support Vector Machines (SVMs) for sequence analysis problems appearing in bioinformatics. For example, I co-organized the PASCAL Large Scale Learning Challenge. Currently, we are preparing JMLR special topic on Large Scale Learning (soon to be published). In addition, I am working on the design of new efficient string kernels as well as faster algorithms to train and evaluate SVMs on sequences. Before I began to work on this topic, it had been almost unthinkable to train SVMs using sophisticated string kernels on more than a few hundred thousand examples. Using the newly developed methods we can now solve learning tasks involving up to 50 million training examples. Requiring reasonable amounts of computing time, we can now apply the resulting classifier to the whole human genome with as much as 6 billion examples.
Genomic Sequence Analysis
I have been working to employ these methods to splice site recognition in several organisms (link). Together with my collaborators, I was able to show that our methods drastically outperform all other methods, which is pivotal for the high accuracy of a novel splice form prediction tool, mSplicer, and the success of a related gene finding system, mGene, in the nGASP competition. Additionally, we have developed a promoter detection system "ARTS" , that detects transcription start sites on the whole human genome. Our approach works with much higher accuracy than previous state of the art methods and by using the developed large scale learning techniques, the SVMs could be trained in only a few hours and applied genome wide.
Interpretability
SVMs find a discrimination in a high dimensional kernel feature space and as such often have to be treated as a black box. This implies that analyses or visualization of the learning result is inherently difficult. It poses a problem for applications in bioinformatics as it is often very important to understand which features are used for learning and why the accuracy is high. I have developed a novel approach based on Multiple Kernel learning that can be used for discovering discriminative features of the underlying biological problem. An extended approach --- the so called Positional Oligomer Importance Matrices (POIMs) --- allows us to pin-point motifs, is very efficient and can be directly applied to the learned SVM classifier.