2011
-
Marius Kloft, Ulf Brefeld, Sören Sonnenburg, and Alexander Zien. Lp-norm Multiple Kernel Learning. Journal of Machine Learning Research, 12:953-997, Mar 2011.
Abstract: Learning linear combinations of multiple kernels is an appealing strategy when the right choice of features is unknown. Previous approaches to multiple kernel learning (MKL) promote sparse kernel combinations to support interpretability and scalability. Unfortunately, this 1-norm MKL is rarely observed to outperform trivial baselines in practical applications. To allow for robust kernel mixtures, we generalize MKL to arbitrary norms. We devise new insights on the connection between several existing MKL formulations and develop two efficient interleaved optimization strategies for arbitrary norms, like p -norms with p > 1. Empirically, we demonstrate that the interleaved optimization strategies are much faster compared to the commonly used wrapper approaches. An experiment on controlled artificial data experiment sheds light on the appropriateness of sparse, non-sparse and infinity norm MKL in various scenarios. Application of p-norm MKL to three hard real-world problems from computational biology show that non-sparse MKL achieves accuraciesthat go beyond the state-of-the-art. We conclude that our improvements finally made MKL fit for deployment to practical applications: MKL now has a good chance of improving the accuracy (over a plain sum kernel) at an affordable computational cost.
Further Information: [ BibTeX | Pdf | Project Page | Dataset ]
-
Vojtech Franc, Soeren Sonnenburg, and Tomas Werner. Cutting Plane Methods in Machine Learning. In Suvrit Sra and Sebastian Nowozin and Stephen J. Wright, editor, In Optimization for Machine Learning, MIT Press, Cambridge, MA. 2011.
Abstract: Cutting plane methods are optimization techniques that incrementally construct an approximation of a feasible set or an objective function by linear inequalities, called cutting planes. Numerous variants of this basic idea are among standard tools used in convex nonsmooth optimization and integer linear programing. Recently, cutting plane methods have seen growing interest in the field of machine learning. In this chapter, we describe the basic theory behind these methods and we show three of their successful applications to solving machine learning problems: regularized risk minimization, multiple kernel learning, and MAP inference in graphical models.
-
Konrad Rieck, Sören Sonnenburg, Sebastian Mika, Christian Schäfer, Pavel Laskov, David Tax, and Klaus-Robert Müller. Support Vector Machines. In Handbook of Computational Stastistics, Springer, 2011. to appear
Abstract:
Further Information: [ BibTeX ]
2010
-
Marius Kloft, Ulf Brefeld, Sören Sonnenburg, and Alexander Zien. Non-Sparse Regularization and Efficient Training with Multiple Kernels. Technical Report CoRR abs/1003.0079 and Technical Report UCB/EECS-2010-21, University of California, Berkeley, 2010.
Abstract: Learning linear combinations of multiple kernels is an appealing strategy when the right choice of features is unknown. Previous approaches to multiple kernel learning (MKL) promote sparse kernel combinations to support interpretability and scalability. Unfortunately, this 1-norm MKL is rarely observed to outperform trivial baselines in practical applications. To allow for robust kernel mixtures, we generalize MKL to arbitrary norms. We devise new insights on the connection between several existing MKL formulations and develop two efficient interleaved optimization strategies for arbitrary norms, like p -norms with p > 1. Empirically, we demonstrate that the interleaved optimization strategies are much faster compared to the commonly used wrapper approaches. An experiment on controlled artificial data experiment sheds light on the appropriateness of sparse, non-sparse and infinity norm MKL in various scenarios. Application of p-norm MKL to three hard real-world problems from computational biology show that non-sparse MKL achieves accuraciesthat go beyond the state-of-the-art. We conclude that our improvements finally made MKL fit for deployment to practical applications: MKL now has a good chance of improving the accuracy (over a plain sum kernel) at an affordable computational cost.
-
Sören Sonnenburg, Gunnar Rätsch, Sebastian Henschel, Christian Widmer, Jonas Behr, Alexander Zien, Fabio de Bona, Alexander Binder, Christian Gehl, and Vojtech Franc. The SHOGUN Machine Learning Toolbox. Journal of Machine Learning Research, 11:1799-1802, June 2010.
Abstract: We have developed a machine learning toolbox, called SHOGUN, which is designed for unified large-scale learning for a broad range of feature types and learning settings. It offers a considerable number of machine learning models such as support vector machines for classification and regression, hidden Markov models, multiple kernel learning, linear discriminant analysis, linear programming machines, and perceptrons. Most of the specific algorithms are able to deal with several different data classes, including dense and sparse vectors and sequences using floating point or discrete data types. We have used this toolbox in several applications from computational biology, some of them coming with no less than 10 million training examples and others with 7 billion test examples. With more than a thousand installations worldwide, SHOGUN is already widely adopted in the machine learning community and beyond. SHOGUN is implemented in C++ and interfaces to MATLAB, R, Octave, Python, and has a stand-alone command line interface. The source code is freely available under the GNU General Public License, Version 3 at http://www.shogun-toolbox.org.
Further Information: [ BibTeX | Pdf | Project Page ]
-
Sören Sonnenburg and Vojtech Franc. COFFIN: A Computational Framework for Linear SVMs. Proceedings of the 27nd International Machine Learning Conference, 2010.
Abstract: In a variety of applications, kernel machines such as Support Vector Machines (SVMs) have been used with great success often delivering state-of-the-art results. Using the kernel trick, they work on several domains and even enable heterogeneous data fusion by concatenating feature spaces or multiple kernel learning. Unfortunately, they are not suited for truly large-scale applications since they suffer from the curse of supporting vectors, e.g., the speed of applying SVMs decays linearly with the number of support vectors. In this paper we develop COFFIN --- a new training strategy for linear SVMs that effectively allows the use of on demand computed kernel feature spaces and virtual examples in the primal. With linear training and prediction effort this framework leverages SVM applications to truly large-scale problems: As an example, we train SVMs for human splice site recognition involving 50 million examples and sophisticated string kernels. Additionally, we learn an SVM based gender detector on 5 million examples on low-tech hardware and achieve beyond the state-of-the-art accuracies on both tasks.
2009
-
Robert Jenssen, Marius Kloft, Alexander Zien, Sören Sonnenburg, and Klaus-Robert Müller. A Multi-Class Support Vector Machine Based on Scatter Criteria. Technical Report 014-2009, Technische Universität Berlin, July 2009.
Abstract: We re-visit Support Vector Machines (SVMs) and provide a novel interpretation thereof in terms of weighted class means and scatter theory. The gained theoretical insight can be translated into a highly existingcient extension to multi-class SVMs: mScatter-SVMs. Numerical simulations reveal that more than an order of magnitude speed-up can be gained while the classification performance remains largely unchanged at the level of the classical one vs. rest and one vs. one implementation of multi-class SVMs.
-
Marius Kloft, Ulf Brefeld, Sören Sonnenburg, Pavel Laskov, Klaus-Robert Müller, and Alexander Zien. Efficient and Accurate Lp-Norm Multiple Kernel Learning. In Y. Bengio and D. Schuurmans and J. Lafferty and C. K. I. Williams and A. Culotta, editor, In Advances in Neural Information Processing Systems 22, pages 997-1005, MIT Press, 2009.
Abstract: Learning linear combinations of multiple kernels is an appealing strategy when the right choice of features is unknown. Previous approaches to multiple kernel learning (MKL) promote sparse kernel combinations to support interpretability. Unfortunately, l1-norm MKL is hardly observed to outperform trivial baselines in practical applications. To allow for robust kernel mixtures, we generalize MKL to arbitrary lp-norms. We devise new insights on the connection between several existing MKL formulations and develop two efficient interleaved optimization strategies for arbitrary p > 1. Empirically, we demonstrate that the interleaved optimization strategies are much faster compared to the traditionally used wrapper approaches. Finally, we apply lp-norm MKL to real-world problems from computational biology, showing that non-sparse MKL achieves accuracies that go beyond the state-of-the-art.
-
Alexander Zien, Nicole Krämer, Sören Sonnenburg, and Gunnar Rätsch. The Feature Importance Ranking Measure. In W. Buntine and M. Grobelnik and D. Mladenic and J. Shawe-Taylor, editor, In Proceedings of the European Conference on Machine Learning, pages 694-709, Springer Berlin / Heidelberg, 2009.
Abstract: Most accurate predictions are typically obtained by learning machines with complex feature spaces (e.g., as induced by kernels). Unfortunately, such decision rules are hardly accessible to humans and cannot easily be used to gain insights about the application domain. Therefore, one often resorts to linear models in combination with variable selection, thereby sacrificing some predictive power for presump- tive interpretability. Here, we introduce the Feature Importance Ranking Measure (FIRM), which by retrospective analysis of arbitrary learning machines allows to achieve both excellent predictive performance and superior interpretation. In contrast to standard raw feature weighting, FIRM takes the underlying correlation structure of the features into account. Thereby, it is able to discover the most relevant features, even if their appearance in the training data is entirely prevented by noise. The desirable properties of FIRM are investigated analytically and illustrated in a few simulations.
-
Gabriele Schweikert, Johnas Behr, Alexander Zien, Georg Zeller, Sören Sonnenburg, and Gunnar Rätsch. mGene.web: a web service for accurate computational gene finding. Nucleic Acids Research:gkp479, 2009.
Abstract: We describe mGene.web, a web service for the genome-wide prediction of protein coding genes from eukaryotic DNA sequences. It offers pre-trained models for the recognition of gene structures including untranslated regions in an increasing number of organisms. With mGene.web, users have the additional possibility to train the system with their own data for other organisms on the push of a button, a functionality that will greatly accelerate the annotation of newly sequenced genomes. The system is built in a highly modular way, such that individual components of the framework, like the promoter prediction tool or the splice site predictor, can be used autonomously. The underlying gene finding system mGene is based on discriminative machine learning techniques and its high accuracy has been demonstrated in an international competition on nematode genomes. mGene.web is available at http://mgene.org/web, it is free of charge and can be used for eukaryotic genomes of small to moderate size (several hundred Mbp).
Further Information: [ BibTeX | Pdf | Project Page ]
-
Vojtech Franc and Sören Sonnenburg. Optimized Cutting Plane Algorithm for Large-Scale Risk Minimization. In M.Sebag, editor, Journal of Machine Learning Research, 10:2157-2192, October 2009.
Abstract: We have developed an optimized cutting plane algorithm (OCA) for solving large scale risk minimization problems. We prove that the number of iterations OCA requires to converge to a epsilon-precise solution is approximately linear in the sample size. We derive OCAS, an OCA based linear binary Support Vector Machine (SVM) solver, and OCAM, a linear multi-class SVM solver. In an extensive empirical evaluation we show that OCAS outperforms current state of the art SVM solvers, like SVMlight, SVMperf and BMRM, achieving speedups of over 1,200 on some datasets over SVMlight and 29 over SVMperf, while obtaining the same precise Support Vector solution. OCAS even in the early optimization steps shows often faster convergence than the so far in this domain prevailing approximative methods SGD and Pegasos. In addition, our proposed linear multi-class SVM solver OCAM achieves speedups of up to 10 compared to SVMmulti-class . Finally, we apply OCAS and OCAM to two real-world applications, the problem of human acceptor splice site detection and malware detection. Effectively parallelizing OCAS we achieve state-of-the art results on the acceptor splice site recognition problem only by being able to learn on all the available 50 million examples in a 12 million dimensional feature space.
-
Gabriele Schweikert, Georg Zeller, Alexander Zien, Jonas Behr, Cheng Soon Ong, Petra Philips, Anja Bohlen, Nina Krüger, Sören Sonnenburg, and Gunnar Rätsch. mGene: A Novel Discriminative Gene Finding System. Genome Research, 19(11):2133-2143, 2009.
Abstract: We present a highly accurate gene prediction system for eukaryotic genomes, called mGene. It combines in an unprecedented manner the flexibility of generalized hidden Markov models with the predictive power of modern machine learning meth- ods, such as Support Vector Machines (SVMs). Its excellent performance was proved in an objective competition based on the genome of the nematode Caenorhabditis elegans (Coghlan et al., 2008). Considering the average of sensitivity and specificity the developmental version of mGene exhibited the best prediction performance on nucleotide, exon, and transcript level for ab initio and multiple-genome gene pre- diction tasks. The fully developed version shows superior performance in ten out of twelve evaluation criteria compared to the other participating gene finders, including Fgenesh++ (Salamov and Solovyev, 2000) and Augustus (Stanke et al., 2006). An in-depth analysis of mGene's genome-wide predictions revealed that $\ 2, 200 predicted genes were not contained in the current genome annotation. Testing a subset of 57 of these genes by RT-PCR and sequencing, we confirmed expression for 24 (42\ of them. mGene missed 300 annotated genes, out of which 205 were unconfirmed. RT-PCR testing of 24 of these genes resulted in a success rate of merely 8\ These findings suggest that even the gene catalog of a well-studied organism such as C. elegans can be substantially improved by mGene's predictions. We also provide gene predictions for the four nematodes C. briggsae, C. brenneri, C. japonica and C. remanei (Stein et al., 2003; Sternberg et al., 2003). Comparing the resulting proteomes among these organisms and to the known protein universe, we identified many species-specific gene inventions. In a quality assessment of several available annotations for these genomes, we find that mGene's predictions are most accurate. Availability: mGene is available as source code under Gnu Public License from the project website http://mgene.org and as a Galaxy-based webserver at http://mgene.org/web. Moreover, the gene predictions have been included in the Worm- base annotation available at http://wormbase.org and the project website.
Further Information: [ BibTeX | Pdf | Project Page ]
-
Sören Sonnenburg. Maschinelles Lernen zur Genom-Sequenzanalyse. In Dorothea Wagner et al. editor, Ausgezeichnete Informatikdissertationen 2008, pages 281-290, 2009.
Abstract: Die Entwicklung neuer Sequenziertechnologien ebnete den Weg für kosteneffiziente Genomsequenzierung. Allein im Jahr 2008 wurden etwa 250 neue Genome sequenziert. Es ist offensichtlich, dass diese gewaltigen Mengen an Daten effektive und genaue computer-gestützte Methoden zur Sequenzanalyse erfordern. Diese werden benötigt, um eines der wichtigsten Probleme der Bioinformatik zu lösen: die akkurate Lokalisation von Genen auf der DNA. In meiner Doktorarbeit habe ich auf Basis von Support Vector Machines (SVMs) genaueste genomische Signalerkenner entwickelt, die in Gensuchmaschinen verwendet werden können. Dazu wurden Datenstrukturen und Algorithmen entwickelt, die besonders effizient mit DNA-Sequenzen umgehen können. Durch Parallelisierung der Algorithmen wurde eine weitere Beschleunigung erreicht, die eine Anwendung auf gesamten Genomen ermöglicht. Einer der von mir entwickelten Genom-Signalerkenner ist Sieger in einem demnächst erscheinenden unabhängigen Vergleich von 17 Erkennern. In der Gensuchmaschine \mGene werden nun die auf meinen Methoden basierenden Signalerkenner eingesetzt. \mGene gewann in der Kategorie ab-initio Gensuche kürzlich einen internationalen Wettbewerb.
Further Information: [ BibTeX ]
-
Sören Sonnenburg and Vojtech Franc. COFFIN: A Computational Framework for Linear SVMs. Research Report CTU-CMP-2009-23, Center for Machine Perception, K13133 FEE Czech Technical University, page 14, Prague, Czech Republic, December 2009.
Abstract: In a variety of applications, kernel machines such as Support Vector Machines (SVMs) have been used with great success often delivering state-of-the-art results. Using the kernel trick, they work on several domains and even enable heterogeneous data fusion by concatenating feature spaces or multiple kernel learning. Unfortunately, they are not suited for truly large-scale applications since they suffer from the curse of supporting vectors, e.g., the speed of applying SVMs decays linearly with the number of support vectors. In this paper we develop COFFIN --- a new training strategy for linear SVMs that effectively allows the use of on demand computed kernel feature spaces and virtual examples in the primal. With linear training and prediction effort this framework leverages SVM applications to truly large-scale problems: As an example, we train SVMs for human splice site recognition involving 50 million examples and sophisticated string kernels. Additionally, we learn an SVM based gender detector on 5 million examples on low-tech hardware and achieve beyond the state-of-the-art accuracies on both tasks.
2008
-
Asa Ben-Hur, Cheng Soon Ong, Sören Sonnenburg, Bernhard Schölkopf, and Gunnar Rätsch. Support Vector Machines and Kernels for Computational Biology. PLoS Comput Biology, 4(10):e1000173, Public Library of Science, October 2008.
Abstract: This is a tutorial on Support Vector Machines and Kernels for Computational Biology, which takes the reader through the basics of machine learning, support vector machines (SVMs) and kernels for real-valued and sequence data. The example of splice site prediction is used to illustrate the main ideas.
Further Information: [ BibTeX | Project Page | Dataset ]
-
Sören Sonnenburg, Alexander Zien, Petra Philips, and Gunnar Rätsch. POIMs: Positional Oligomer Importance Matrices -- Understanding Support Vector Machine Based Signal Detectors. Bioinformatics, July 2008. (received the Outstanding Student Paper Award at ISMB 2008)
Abstract: Motivation: At the heart of many important bioinformatics problems, such as gene finding and function prediction, is the classification of biological sequences. Frequently the most accurate classifiers are obtained by training support vector machines (SVMs) with complex sequence kernels. However, a cumbersome shortcoming of SVMs is that their learned decision rules are very hard to understand for humans and cannot easily be related to biological facts. Results: To make SVM-based sequence classifiers more accessible and profitable, we introduce the concept of positional oligomer importance matrices (POIMs) and propose an efficient algorithm for their computation. In contrast to the raw SVM feature weighting, POIMs take the underlying correlation structure of k-mer features induced by overlaps of related k-mers into account. POIMs can be seen as a powerful generalization of sequence logos: they allow to capture and visualize sequence patterns that are relevant for the investigated biological phenomena. Availability: All source code, data sets, tables and figures will be made availabe at \http://www.fml.tuebingen.mpg.de/raetsch/projects/POIM.
Further Information: [ BibTeX | Pdf | Project Page | Dataset ]
-
Vojtech Franc and Sören Sonnenburg. OCAS Optimized cutting plane algorithm for support vector machines. Proceedings of the 25nd International Machine Learning Conference, ACM Press, June 2008.
Abstract: We have developed a new Linear Support Vector Machine (SVM) training algorithm called OCAS. Its computational effort scales linearly with the sample size. In an extensive empirical evaluation OCAS significantly outperforms current state of the art SVM solvers, like SVMLight, SVMPerf and BMRM, achieving speedups of over 1,000 on some datasets over SVMLight and 20 over SVMPerf, while obtaining the same precise Support Vector solution. OCAS even in the early optimization steps shows often faster convergence than the so far in this domain prevailing approximative methods SGD and Pegasos. Effectively parallelizing OCAS we were able to train on a dataset of size 15 million examples (itself about 32GB in size) in just 671 seconds --- a competing string kernel SVM required 97,484 seconds to train on 10 million examples sub-sampled from this dataset.
Further Information: [ BibTeX | Pdf | Project Page | Dataset ]
-
Sören Sonnenburg. Machine Learning for Genomic Sequence Analysis. PhD thesis, Fraunhofer Institute FIRST, December 2008. supervised by K.-R. Müller and G.Rätsch
Abstract: With the development of novel sequencing technologies, the way has been paved for cost efficient, high-throughput whole genome sequencing. In the year 2008 alone, about 250 genomes will have been sequenced. It is self-evident that the handling of this wealth of data requires efficient and accurate computational methods for sequence analysis. They are needed to tackle one of the most important problems in computational biology: the localisation of genes on DNA. In this thesis, I describe the development of state-of-the-art genomic signal detectors based on Support Vector Machines (SVM) that can be used in gene finding systems. The main contributions of this work are computationally efficient and accurate String Kernels, Large Scale Learning Methods, Methods to interprete SVM Classifiers.
2007
-
Alexander Zien, Petra Philips, and Sören Sonnenburg. Computing Positional Oligomer Importance Matrices (POIMs). Research Report; Electronic Publication 2, Fraunhofer Institute FIRST, December 2007.
Abstract: We show how to efficiently compute Positional Oligomer Importance Matrices (POIMs) which are a novel and powerful way to extract, rank, and visualize higher order (i.e. oligo-nucleotide) compositional information for nucleotide sequences. Given a scoring function for nucleotide sequences which is linear w.r.t. positionwise occurrences of oligomers, POIMs quantify the increase (or decrease) of the expected score caused by information about each k-mer at each position. We demonstrate how to obtain a recursive algorithm which enables us to efficiently compute POIMs by using string index data structures. This is especially useful for scoring functions whose linear weighting is sparse, as is the case for the scoring function produced by string kernel classifiers.
-
Vojtech Franc and Sören Sonnenburg. Optimized cutting plane algorithm for support vector machines. Research Report; Electronic Publication 1, Fraunhofer Institute FIRST, December 2007.
Abstract: We have developed a new Linear Support Vector Machine (SVM) training algorithm called OCAS, which itself significantly outperforms current state of the art SVM solvers, like SVMLight, SVMPerf and BMRM, achieving speedups of over 1,200 on some datasets over SVMLight and 29 over SVMPerf, while obtaining the same precise Support Vector solution. Using OCAS we were able to train on a dataset of size 15 million examples (itself about 32GB in size) in just 671 seconds - a competing string kernel SVM required 97,484 seconds to train on 10 million examples sub-sampled from this dataset. This was possible due to effective parallelizing of the core parts of OCAS which lead to additional speedups of factors up to 4.6 on a multi-core multiprocessor machine.
-
Sören Sonnenburg , Gabriele Schweikert, Petra Philips, Jonas Behr, and Gunnar Rätsch. Accurate Splice Site Prediction. BMC Bioinformatics, Special Issue from NIPS workshop on New Problems and Methods in Computational Biology Whistler, Canada, 18 December 2006, 8:(Suppl. 10):S7, December 2007.
Abstract: Background: For splice site recognition, one has to solve two classification problems: discriminating true from decoy splice sites for both acceptor and donor sites. Gene finding systems typically rely on Markov Chains to solve these tasks. Results: In this work we consider Support Vector Machines for splice site recognition. We employ the so-called weighted degree kernel which turns out well suited for this task, as we will illustrate in several experiments where we compare its prediction accuracy with that of recently proposed systems. We apply our method to the genome-wide recognition of splice sites in Caenorhabditis elegans, Drosophila melanogaster, Arabidopsis thaliana, Danio rerio, and Homo sapiens. Our performance estimates indicate that splice sites can be recognized very accurately in these genomes and that our method outperforms many other methods including Markov Chains, GeneSplicer and SpliceMachine. We provide genome-wide predictions of splice sites and a stand-alone prediction tool ready to be used for incorporation in a gene finder.
-
Sören Sonnenburg, Mikio Braun, Cheng Soon Ong, Samy Bengio, Leon Bottou, Geoffrey Holmes and Yann LeCun, Klaus-Robert Müller, Fernando Pereira, Carl Edward Rasmussen, Gunnar Rätsch and Bernhard Schölkopf, Alexander Smola, Pascal Vincent, Jason Weston, and Robert Williamson. The Need for Open Source Software in Machine Learning. In David Cohn, editor, Journal of Machine Learning Research, 8:2443-2466, September 2007.
Abstract: Open source tools have recently reached a level of maturity which makes them suitable for building large-scale real-world systems. At the same time, the field of machine learning has developed a large body of powerful learning algorithms for diverse applications. However, the true potential of these methods is not used, since existing implementations are not openly shared, resulting in software with low usability, and weak interoperability. We argue that this situation can be significantly improved by increasing incentives for researchers to publish their software under an open source model. Additionally, we outline the problems authors are faced with when trying to publish algorithmic implementations of machine learning methods. We believe that a resource of peer reviewed software accompanied by short articles would be highly valuable to both the machine learning and the general scientific community.
-
Sören Sonnenburg, Gunnar Rätsch, and Konrad Rieck. Large Scale Learning with String Kernels. In Bottou, Leon and Chapelle, Olivier and DeCoste, Dennis and Weston, Jason, editor, In Large Scale Kernel Machines, pages 73-103, MIT Press, Cambridge, MA. 2007.
Abstract: In genomic sequence analysis tasks like splice site recognition or promoter identification, large amounts of training sequences are available, and indeed needed to achieve sufficiently high classification performances. In this chapter we study string kernels that can be computed in linear time w.r.t. the length of the input sequences. In particular the recently proposed Spectrum kernel, the Weighted Degree kernel (WD) and the Weighted Degree kernel with shifts, which have been successfully used for various sequence analysis tasks. We discuss extensions using data structures such as tries and suffix trees as well as modifications of a SVM chunking algorithm in order to significantly accelerate SVM training and their evaluation on test sequences. Our simulations using the WD kernel and Spectrum kernel show that large scale SVM training can be accelerated by factors of 7 and 60 times, respectively, while requiring considerably less memory. We demonstrate that these algorithms can be effectively parallelized for further acceleration. Our method allows us to train SVMs on sets as large as 10 million sequences and solve Multiple Kernel Learning problems with 1 million sequences. Moreover, using these techniques the evaluation on new sequences is often several thousand times faster, allowing us to apply the classifiers on genome-sized data sets with seven billion test examples. We finally demonstrate how the proposed data structures can be used to understand the SVM classifiers decision function. All presented algorithms are implemented in our Machine Learning toolbox \SHOGUN for which the source code is publicly available at \http://www.fml.tuebingen.mpg.de/raetsch/projects/shogun.
-
Gunnar Rätsch and Sören Sonnenburg. Large Scale Hidden Semi-Markov SVMs. In B. Schölkopf and J. Platt and T. Hoffman, editor, In Advances in Neural Information Processing Systems 19, pages 1161-1168, MIT Press, Cambridge, MA, 2007.
Abstract: We describe Hidden Semi-Markov Support Vector Machines (SHM SVMs), an extension of HM SVMs to semi-Markov chains. This allows us to predict seg- mentations of sequences based on segment-based features measuring properties such as the length of the segment. We propose a novel technique to partition the problem into sub-problems. The independently obtained partial solutions can then be recombined in an efficient way, which allows us to solve label sequence learn- ing problems with several thousands of labeled sequences. We have tested our algorithm for predicting gene structures, an important problem in computational biology. Results on a well-known model organism illustrate the great potential of SHM SVMs in computational biology.
-
Konrad Rieck, Pavel Laskov, and Sören Sonnenburg. Computation of Similarity Measures for Sequential Data using Generalized Suffix Trees. In B. Schölkopf and J. Platt and T. Hoffman, editor, In Advances in Neural Information Processing Systems 19, pages 1177-1184, MIT Press, Cambridge, MA, 2007.
Abstract: We propose a generic algorithm for computation of similarity measures for sequential data. The algorithm uses generalized suffix trees for efficient calculation of various kernel, distance and non-metric similarity functions. Its worst-case run-time is linear in the length of sequences and independent of the underlying embedding language, which can cover words, k-grams or all contained subsequences. Experiments with network intrusion detection, DNA analysis and text processing applications demonstrate the utility of distances and similarity coefficients for sequences as alternatives to classical kernel functions.
-
Gunnar Rätsch, Sören Sonnenburg, Jagan Srinivasan, Hanh Witte, Klaus-Robert Müller, Ralf Sommer, and Bernhard Schölkopf. Improving the C. elegans genome annotation using machine learning. PLoS Computational Biology, 3:e20, February 2007.
Abstract: For modern biology, precise genome annotations are of prime importance as they allow the accurate definition of genic regions. We employ state of the art machine learning methods to assay and improve the accuracy of the genome annotation of the nematode Caenorhabditis elegans. The proposed machine learning system is trained to recognize exons and introns on the unspliced mRNA utilizing recent advances in support vector machines and label sequence learning. In 87\ (coding and untranslated regions) and 95\ (coding regions only) of all genes tested in several out-of-sample evaluations, our method correctly identified all exons and introns. Notably, only 37\ and 50\ respectively, of the presently unconfirmed genes in the C. elegans genome annotation agree with our predictions, thus we hypothesize that a sizable fraction of those genes are not correctly annotated. A retrospective evaluation of the Wormbase WS120 annotation [1] of C. elegans reveals that splice form predictions on unconfirmed genes in WS120 are inaccurate in about 18\ of the considered cases, while our predictions deviate from the truth only in 10-13\ We experimentally analyzed 20 controversial genes on which our system and the annotation disagree, confirming the superiority of our predictions. While our method correctly predicted 75\ of those cases, the standard annotation was never completely correct. The accuracy of our system is further corroborated by a comparison with two other recently proposed systems that can be used for splice form prediction: SNAP and ExonHunter. We conclude that the genome annotation of C. elegans and other organisms can be greatly enhanced using modern machine learning technology.
2006
-
Sören Sonnenburg, Alexander Zien, and Gunnar Rätsch. ARTS: Accurate Recognition of Transcription Starts in Human. Bioinformatics, 22(14):e472-480, 2006.
Abstract: We develop new methods for finding transcription start sites (TSS) of RNA Polymerase II binding genes in genomic DNA sequences. Employing Support Vector Machines with advanced sequence kernels, we achieve drastically higher prediction accuracies than state-of-the-art methods. Motivation: One of the most important features of genomic DNA are the protein-coding genes. While it is of great value to identify those genes and the encoded proteins, it is also crucial to understand how their transcription is regulated. To this end one has to identify the corresponding promoters and the contained transcription factor binding sites. TSS finders can be used to locate potential promoters. They may also be used in combination with other signal and content detectors to resolve entire gene structures. Results: We have developed a novel kernel based method – called ARTS – that accurately recognizes transcription start sites in human. The application of otherwise too computationally expensive Support Vector Machines was made possible due to the use of efficient training and evaluation techniques using suffix tries. In a carefully designed experimental study, we compare our TSS finder to state-of-the-art methods from the literature: McPromoter, Eponine and FirstEF. For given false positive rates within a reasonable range, we consistently achieve considerably higher true positive rates. For instance, ARTS finds about 35\ true positives at a false positive rate of 1/1000, where the other methods find about a half (18\
-
Sören Sonnenburg, Gunnar Rätsch, Christin Schäfer, and Bernhard Schölkopf. Large Scale Multiple Kernel Learning. In K.Bennett and E.P.-Hernandez, editor, Journal of Machine Learning Research, 7:1531-1565, July 2006.
Abstract: While classical kernel-based learning algorithms are based on a single kernel, in practice it is often desirable to use multiple kernels. Lankriet et al. (2004) considered conic combinations of kernel matrices for classification, leading to a convex quadratically constrained quadratic program. We show that it can be rewritten as a semi-infinite linear program that can be efficiently solved by recycling the standard SVM implementations. Moreover, we generalize the formulation and our method to a larger class of problems, including regression and one-class classification. Experimental results show that the proposed algorithm works for hundred thousands of examples or hundreds of kernels to be combined, and helps for automatic model selection, improving the interpretability of the learning result. In a second part we discuss general speed up mechanism for SVMs, especially when used with sparse feature maps as appear for string kernels, allowing us to train a string kernel SVM on a 10 million real-world splice dataset from computational biology. We integrated Multiple Kernel Learning in our Machine Learning toolbox \SHOGUN for which the source code is publicly available at \http://www.fml.tuebingen.mpg.de/raetsch/projects/shogun.
-
Sören Sonnenburg, Gunnar Rätsch, and Christin Schäfer. A General and Efficient Multiple Kernel Learning Algorithm. In Y. Weiss and B. Schölkopf and J. Platt, editor, Advances in Neural Information Processing Systems 18, pages 1273-1280, MIT Press, Cambridge, MA, 2006.
Abstract: While classical kernel-based learning algorithms are based on a single kernel, in practice it is often desirable to use multiple kernels. Lankriet et al. (2004) considered conic combinations of kernel matrices for classification, leading to a convex quadratically constraint quadratic program. We show that it can be rewritten as a semi-infinite linear program that can be efficiently solved by recycling the standard SVM implementations. Moreover, we generalize the formulation and our method to a larger class of problems, including regression and one-class classification. Experimental results show that the proposed algorithm helps for automatic model selection, improving the interpretability of the learning result and works for hundred thousands of examples or hundreds of kernels to be combined.
-
Gunnar Rätsch, Sören Sonnenburg, and Christin Schäfer. Learning Interpretable SVMs for Biological Sequence Classification. BMC Bioinformatics, Special Issue from NIPS workshop on New Problems and Methods in Computational Biology Whistler, Canada, 18 December 2004, 7:(Suppl. 1):S9, March 2006.
Abstract: Background: Support Vector Machines (SVMs) -- using a variety of string kernels -- have been successfully applied to biological sequence classification problems. While SVMs achieve high classification accuracy they lack interpretability. In many applications, it does not suffice that an algorithm just detects a biological signal in the sequence, but it should also provide means to interpret its solution in order to gain biological insight. Results: We propose novel and efficient algorithms for solving the so-called Support Vector Multiple Kernel Learning problem. The developed techniques can be used to understand the obtained support vector decision function in order to extract biologically relevant knowledge about the sequence analysis problem at hand. We apply the proposed methods to the task of acceptor splice site prediction and to the problem of recognizing alternatively spliced exons. Our algorithms compute sparse weightings of substring locations, highlighting which parts of the sequence are important for discrimination. Conclusion: The proposed method is able to deal with thousands of examples while combining hundreds of kernels within reasonable time, and reliably identifies a few statistically significant positions.
2005
-
Gunnar Rätsch, Sören Sonnenburg, and Bernhard Schölkopf. RASE: Recognition of Alternatively Spliced Exons in C. elegans. Proceedings of the Annual International conference on Intelligent Systems for Molecular Biology (ISMB), Bioinformatics, 21:i369-i377, June 2005.
Abstract: Eukaryotic pre-mRNAs are spliced to form mature mRNA. Pre-mRNA alternative splicing greatly increases the complexity of gene expression. Estimates show that more than half of the human genes and at least one-third of the genes of less complex organisms, such as nematodes or flies, are alternatively spliced. In this work, we consider one major form of alternative splicing, namely the exclusion of exons from the transcript. It has been shown that alternatively spliced exons have certain properties that distinguish them from constitutively spliced exons. Although most recent computational studies on alternative splicing apply only to exons which are conserved among two species, our method only uses information that is available to the splicing machinery, i.e. the DNA sequence itself. We employ advanced machine learning techniques in order to answer the following two questions: (1) Is a certain exon alternatively spliced? (2) How can we identify yet unidentified exons within known introns? We designed a support vector machine (SVM) kernel well suited for the task of classifying sequences with motifs having positional preferences. In order to solve the task (1), we combine the kernel with additional local sequence information, such as lengths of the exon and the flanking introns. The resulting SVM-based classifier achieves a true positive rate of 48.5\ at a false positive rate of 1\ By scanning over single EST confirmed exons we identified 215 potential alternatively spliced exons. For 10 randomly selected such exons we successfully performed biological verification experiments and confirmed three novel alternatively spliced exons. To answer question (2), we additionally used SVM-based predictions to recognize acceptor and donor splice sites. Combined with the above mentioned features we were able to identify 85.2\ of skipped exons within known introns at a false positive rate of 1\
-
Sören Sonnenburg, Gunnar Rätsch, and Bernhard Schölkopf. Large Scale Genomic Sequence SVM Classifiers. Proceedings of the 22nd International Machine Learning Conference, ACM Press, 2005.
Abstract: In genomic sequence analysis tasks like splice site recognition or promoter identification, large amounts of training sequences are available, and indeed needed to achieve sufficiently high classification performances. In this work we study two recently proposed and successfully used kernels, namely the Spectrum kernel and the Weighted Degree kernel (WD). In particular, we suggest several extensions using Suffix Trees and modifications of an SMO-like SVM training algorithm in order to accelerate the training of the SVMs and their evaluation on test sequences. Our simulations show that for the spectrum kernel and WD kernel, large scale SVM training can be accelerated by factors of 20 and 4 times, respectively, while using much less memory (e.g. no kernel caching). The evaluation on new sequences is often several thousand times faster using the new techniques (depending on the number of Support Vectors). Our method allows us to train on sets as large as one million sequences.
-
Sören Sonnenburg, Gunnar Rätsch, and Christin Schäfer. Learning Interpretable SVMs for Biological Sequence Classification. RECOMB 2005, LNBI 3500, pages 389-407, Springer-Verlag Berlin Heidelberg, 2005.
Abstract: Abstract. We propose novel algorithms for solving the so-called Support Vector Multiple Kernel Learning prob- lem and show how they can be used to understand the resulting support vector decision function. While classical kernel-based algorithms (such as SVMs) are based on a single kernel, in Multiple Kernel Learning a quadratically- constraint quadratic program is solved in order to find a sparse convex combination of a set of support vector kernels. We show how this problem can be cast into a semi-infinite linear optimization problem which can in turn be solved efficiently using a boosting-like iterative method in combination with standard SVM optimization algorithms. The proposed method is able to deal with thousands of examples while combining hundreds of kernels within reasonable time. In the second part we show how this technique can be used to understand the obtained decision function in order to extract biologically relevant knowledge about the sequence analysis problem at hand. We consider the problem of splice site identification and combine string kernels at different sequence positions and with various substring (oligomer) lengths. The proposed algorithm computes a sparse weighting over the length and the substring, high- lighting which substrings are important for discrimination. Finally, we propose a bootstrap scheme in order to reli- ably identify a few statistically significant positions, which can then be used for further analysis such as consensus finding.
-
Klaus-Robert Müller, Gunnar Rätsch, Sören Sonnenburg, Sebastian Mika, Michael Grimm, and Nikolaus Heinrich. Classifying 'Drug-likeness' with Kernel-Based Learning Methods. J. Chem. Inf. Model, 45:249-253, 2005.
Abstract: In this article we report about a successful application of modern machine learning technology, namely Support Vector Machines, to the problem of assessing the 'drug-likeness' of a chemical from a given set of descriptors of the substance. We were able to drastically improve the recent result by Byvatov et al. (2003) on this task and achieved an error rate of about 7\ on unseen compounds using Support Vector Machines. We see a very high potential of such machine learning techniques for a variety of computational chemistry problems that occur in the drug discovery and drug design process.
2004
-
Gunnar Rätsch and Sören Sonnenburg. Accurate Splice Site Prediction for Caenorhabditis Elegans. In Kernel Methods in Computational Biology, page 277-298, MIT Press, 2004.
Abstract: We propose a new system for predicting the splice form of Caenorhabditis elegans genes. As a first step we generate a clean set of genes from available exressed sequence tags (EST) and complete complementary (cDNA) sequences. From all such genes we then generate potential acceptor and donor sites as they would be required by any gene finder. This leads to a clean set of true and decoy splice sites. In a second step we use support vector machines (SVMs) with appropriately designed kernels to learn to distinguish between true and decoy sites. Using the newly generated data and the novel kernels we could considerably improve our previous results on the same task. In the last step we design and test a new splice finder system that combines the SVM predictions with additional statistical information about splicing. Using this system we are able to predict the exon-intron structure of a given gene with known translation initiation and stop co don site. The system has been tested successfully on a newly generated set of genes and compared with GenScan. We found that our system predicts the correct splice form for more than 92\ of these genes, whereas GenScan only achieves 77.5\ accuracy.
2002
-
Sören Sonnenburg. New Methods for Splice Site Recognition. 2002. supervised by K.-R. Müller H.-D. Burkhard and G.Rätsch
Abstract: Modelling \splice sites is considered a difficult task, and as of this writing, the procedure of splicing is still not well understood. We combine successful \discriminative learners like Support Vector Machines (SVM) and \descriptive learners like Hidden Markov Models (HMMs) to separate true splice sites from decoys. Recently developed kernel functions like the TOP- and Fisher Kernel (FK) that are derived from generative models are used to combine SVMs and HMMs. Furthermore, results for the well known Locality Improved Kernel are presented and its connection to the FK, derived from a special HMM is shown. Finally we provide an experimental analysis of splice sites and investigate the classification performance using a variety of learning machines.
-
Sören Sonnenburg, Gunnar Rätsch, Arun Jagota, and Klaus-Robert Müller. New Methods for Splice-Site Recognition. In Proceedings of the International Conference on Artifical Neural Networks., pages 329-336, 2002. Copyright by Springer
Abstract: Splice sites are locations in DNA which separate protein-coding regions (exons) from noncoding regions (introns). Accurate splice site detectors thus form important components of computational gene finders. We pose splice site recognition as a classification problem with the classifier learnt from a labeled data set consisting of only local information around the potential splice site. Note that finding the correct position of splice sites without using global information is a rather hard task. We analyze the genomes of the nematode Caenorhabditis elegans and of humans using specially designed support vector kernels. One of the kernels is adapted from our previous work on detecting translation initiation sites in vertebrates and another uses an extension to the well-known Fisher-kernel. We find excellent performance on both data sets.
-
Koji Tsuda, Motoaki Kawanabe, Gunnar Rätsch, Sören Sonnenburg, and Klaus-Robert Müller. A New Discriminative Kernel from Probabilistic Models. Neural Computation, 14:2397-2414, 2002.
Abstract: Recently, Jaakkola and Haussler proposed a method for constructing ker- nel functions from probabilistic models. Their so called “Fisher kernel” has been combined with discriminative classifiers such as SVM and applied successfully in e.g. DNA and protein analysis. Whereas the Fisher kernel (FK) is calculated from the marginal log-likelihood, we propose the TOP kernel derived from Tangent vectors Of Posterior log-odds. Furthermore we develop a theoretical framework on feature extractors from probabilistic models and use it for analyzing FK and TOP. In experiments our new discriminative TOP kernel compares favorably to the Fisher kernel.
-
Koji Tsuda, Motoaki Kawanabe, Gunnar Rätsch, Sören Sonnenburg, and Klaus-Robert Müller. A New Discriminative Kernel from Probabilistic Models. In T.G.Dietterich and S.Becker and Z.Ghahramani, editor, Advances in Neural information processings systems, pages 977-984, MIT Press, Cambridge, MA, 2002.
Abstract: Recently, Jaakkola and Haussler proposed a method for constructing ker- nel functions from probabilistic models. Their so called “Fisher kernel” has been combined with discriminative classifiers such as SVM and applied successfully in e.g. DNA and protein analysis. Whereas the Fisher kernel (FK) is calculated from the marginal log-likelihood, we propose the TOP kernel derived from Tangent vectors Of Posterior log-odds. Furthermore we develop a theoretical framework on feature extractors from probabilistic models and use it for analyzing FK and TOP. In experiments our new discriminative TOP kernel compares favorably to the Fisher kernel.
2001
-
Sören Sonnenburg. Hidden Markov Model for Genome Analysis. 2001.
Abstract: A Hidden Markov Model (HMM) is a Markov chain, where each state generates an observation. One only sees the observations, and the goal is to infer the \hidden state sequence. HMMs have been applied successfully e.g.to Speech Recognition tasks and to biological sequence analysis. They have recently been used for gene finding and for identifying homologous sequences. The aim of our work was to create a reliable, flexible and efficient tool that can be used for \genome analysis. Our implementation exhibits some particularly interesting features, such as Viterbi and Baum-Welch training algorithms, customizable model structures and support for higher order Hidden Markov models. It has been particularly tuned for large numbers of observations. To speed-up the computations it also supports multi-processor systems. By this work we have created the basis for developing more sophisticated algorithms for genome analysis in subsequent projects.