RSS Feed: Software

ASP - Accurate Splice Site Predictor

Splicing For splice site recognition, one has to solve two classification problems: discriminating true from decoy splice sites for both acceptor and donor sites. Gene finding systems typically rely on Markov Chains to solve these tasks. In this work we consider Support Vector Machines for splice site recognition. We employ the so-called weighted degree kernel which turns out well suited for this task, as we will illustrate in several experiments where we compare its prediction accuracy with that of recently proposed systems. We apply our method to the genome-wide recognition of splice sites in Caenorhabditis elegans, Drosophila melanogaster, Arabidopsis thaliana, Danio rerio, and Homo sapiens. Our performance estimates indicate that splice sites can be recognized very accurately in these genomes and that our method outperforms many other methods including Markov Chains, GeneSplicer and SpliceMachine.

Software available from here.

ARTS - Accurate Transcription Start Site Prediction

ARTS We develop new methods for finding transcription start sites (TSS) of RNA Polymerase II binding genes in genomic DNA sequences. Employing Support Vector Machines with advanced sequence kernels, we achieve drastically higher prediction accuracies than state-of-the-art methods.

Software available from here.

POIMs - Positional Oligomer Importance Matrices

SVMs find a discrimination in a high dimensional kernel feature space and as such often have to be treated as a black box. This implies that POIManalyses or visualization of the learning result is inherently difficult. It poses a problem for applications in bioinformatics as it is often very important to understand which features are used for learning and why the accuracy is high. We have developed the concept of Positional Oligomer Importance Matrices (POIMs) --- that allows us to pin-point and visualize the most discriminative motifs. Computing poims is very efficient and can be directly applied to the learned SVM classifier.

mSplicer - Splice Form Prediction

mSplicer For modern biology, precise genome annotations are of prime importance as they allow the accurate definition of genic regions. We employ state of the art machine learning methods to assay and improve the accuracy of the genome annotation of the nematode Caenorhabditis elegans. The proposed machine learning system is trained to recognize exons and introns on the unspliced mRNA utilizing recent advances in support vector machines and label sequence learning.

This software is available from and its project page.

mGene - Accurate Genefinding

mGene mGene is a gene finding system that was developed at the Friedrich Miescher Lab in Tübingen. It tackles the gene prediction problem using a two-layered approach. In a first step (layer 1) state-of-the-art kernel machines are employed to detect signal sequences in genomic DNA. In a second step (layer 2) their outputs are combined by a Hidden Semi Markov machine learning algorithm, to predict whole gene structures. Major algorithms, which were used, are implemented in the SHOGUN toolbox and combined and complemented with Octave/Matlab scripts.

mGene is described at and web-interface that lets practitioners apply and re-train mGene is available via Galaxy.

About Me

Sören Sonnenburg Dr. Sören Sonnenburg
Machine Learning / Bioinformatics Research Scientist
Berlin, Germany