Large Scale Learning
Much of my current research focuses on kernel methods such as Support Vector Machines (SVMs) for sequence analysis problems appearing in bioinformatics. For example, I co-organized the PASCAL Large Scale Learning Challenge. Currently, we are preparing JMLR special topic on Large Scale Learning (soon to be published).
In addition, I am working on the design of new efficient string kernels as well
as faster algorithms to train and evaluate SVMs on sequences. Before I began to work on this topic, it had been almost unthinkable to train SVMs using sophisticated string kernels on more than a few hundred thousand examples. Using the newly developed methods we can now solve learning tasks involving up to 50 million training examples. Requiring reasonable amounts of computing time, we can now apply the resulting classifier to the whole human genome with as much as 6 billion examples.
Genomic Sequence Analysis
I have been working to employ these methods to splice site recognition
in several organisms (link). Together with my collaborators, I was able to show
that our methods drastically outperform all other methods, which is pivotal for the high accuracy of a novel
splice form prediction tool, mSplicer, and the
success of a related gene finding system, mGene, in the
nGASP competition. Additionally, we have developed a promoter detection system
"ARTS" , that detects
transcription start sites on the whole human genome. Our approach works
with much higher accuracy than previous state of the art methods and by
using the developed large scale learning techniques, the SVMs could be
trained in only a few hours and applied genome wide.
Interpretability
SVMs find a discrimination in a high dimensional kernel feature space
and as such often have to be treated as a black box. This implies that
analyses or visualization of the learning result is inherently
difficult. It poses a problem for applications in bioinformatics as it
is often very important to understand which features are used for
learning and why the accuracy is high. I have developed a novel approach
based on Multiple
Kernel learning that
can be used for discovering discriminative features of the
underlying biological problem. An extended approach --- the so
called Positional Oligomer Importance Matrices (POIMs) ---
allows us to pin-point motifs, is very efficient and can be
directly applied to the learned SVM classifier.