Supplementary Material for the Recomb 2005 and the BMC Bioinformatics submission
Learning Interpretable SVMs for Biological Sequence Classification
|
This page contains additional material to the above mentioned paper. We tried to document exactly
- which data sets where used and
- what results where achieved.
In Section 1 we provide the toy data
set for the different noise levels and the C. elegans and Drosophila
melanogaster acceptor splice data sets.
Larger versions of the result images for the toy data
set which were also in the paper can be found in
Section 2. We extended the
experiments for C. elegans and Drosophila
melanogaster to 100 bootstrap trials and also show the ROC Score achieved in each trial. These aditional figures can also be found in
Section 2.
|
|
A downloadable version of the software will be made available soon.
|
Download Datasets
The datasets are in the format
-1 TTCTGAAGAAGACGATGACGAAGACGAAGGAGAAGCCGTTGCAGAACTTGTCACAAAGTG
-1 CCAACCTAATCGTTATACATATGTATTTACAGTCGCAAATGACAATTGAACAAATAAATG
....
+1 AATGTTTCAATTATAAAAATTGTTAATTACAGGGGGACACCTGTATCAGTGTGACATTTC
....
|
whereas the number -1 means randomly generated site (resp. no splice site)
while +1 means site with custom motif (resp. splice site). Then after a space the sequence follows.
-
Toy Data
The supplied 11500 sequences are all 50bp long.
- no noise
- 2 symbols replaced in each motif
- 4 symbols replaced in each motif
- 5 symbols replaced in each motif
-
C. elegans acceptor sites
The supplied 262421 sequences (15507 true splice
sites, i.e. 5.9%) are all 141bp long and centered
around the true acceptor site.
-
Drosophila melanogaster acceptor sites
The supplied 98367 sequences (1583 true splice
sites, i.e. 1.6%) are all 141bp long and
centered around the true acceptor site.
Results for Weighted Degree Kernel
Result files contain a line about the actual validation and test error
followed by the actual classifier output.
validation error=0.014181 test error=0.01214
-12.143139
-10.286769
...
|
SVMs including kernel weights are saved in the following
format:
b=-3.577909
alphas=[
2 -1.000000
13 +0.373805
57 +1.000000
68 -0.332549
85 -1.000000
...
]
betas=[
[+0.373805 +0.373805 +0.373805 +0.373805 +0.373805 +0.373805 +0.373805+0.373805 +0.373805 +0.373805 +0.373805];
[+0.373805 +0.373805 +0.373805 +0.373805 +0.373805 +0.373805 +0.373805+0.373805 +0.373805 +0.373805 +0.373805];
[+0.373805 +0.373805 +0.373805 +0.373805 +0.373805 +0.373805 +0.373805+0.373805 +0.373805 +0.373805 +0.373805];
...
]
|
Weights obtained in bootstrap trials are saved as shown here:
betas of trial 001 = [
+0.000000 +0.000000 +0.000000 +0.000000 +0.000000 +0.000000 +0.000000 +0.000000
+0.000000 +0.000000 +0.000000 +1.000000 +0.000000 +0.000000 +0.000000 +0.000000
+0.000000 +0.000000 +0.000000 +0.000000 +0.000000 +0.000000 +0.000000 +0.000000
+0.000000 +0.000000 +0.000000 +0.000000 +0.000000 +0.000000 +0.000000 +0.000000
+0.000000 +0.000000 +0.000000 +0.000000 +0.000000 +0.000000 +0.000000 +0.000000
+0.000000 +0.000000 +0.000000 +0.000000 +0.000000 +0.000000 +0.000000 +0.000000
+0.000000 +0.000000 +0.000000 +0.000000 +0.000000 +0.000000 +0.000000 +0.000000
]
betas of trial 002 = [
+0.000000 +0.000000 +0.000000 +0.000000 +0.000000 +0.000000 +0.000000 +0.000000
+0.000000 +0.000000 +0.000000 +0.000000 +0.000000 +0.000000 +0.000000 +0.000000
+0.000000 +0.000000 +0.000000 +0.000000 +0.000000 +0.000000 +0.000000 +0.000000
+0.000000 +0.000000 +0.000000 +0.000000 +0.000000 +1.000000 +0.000000 +0.000000
+0.000000 +0.000000 +0.000000 +0.000000 +0.000000 +0.000000 +0.000000 +0.000000
+0.000000 +0.000000 +0.000000 +0.000000 +0.000000 +0.000000 +0.000000 +0.000000
+0.000000 +0.000000 +0.000000 +0.000000 +0.000000 +0.000000 +0.000000 +0.000000
]
...
|
Comparison with original WD Kernel
The C. elegans dataset was split as
follows first 100,000 examples training, next
100,000 examples validation, remaining examples
test).
| Model selection result for original
WD kernel. Elements in table denote
ROC Score achieved on validation set |
Model selection result for proposed
WD kernel (weights are also
learned)Elements in table denote ROC
Score on validation set. |
| M\C | 0.5 | 2 | 5 | 10 |
| 10 | 99.65% | 99.65% | 99.65% | 99.65% |
| 12 | 99.65% | 99.65% | 99.65% | 99.65% |
| 15 | 99.66% | 99.66% | 99.66% | 99.66% |
| 17 | 99.66% | 99.66% | 99.66% | 99.66% |
| 20 | 99.66% | --.--% | --.--% | 99.66% |
|
| M\C | 0.5 | 2 | 5 | 10 |
| 10 | 99.66% | 99.66% | 99.66% | 99.65% |
| 12 | 99.66% | 99.66% | 99.66% | 99.66% |
| 15 | 99.66% | 99.66% | 99.65% | 99.65% |
| 17 | 99.65% | 99.66% | 99.65% | 99.65% |
| 20 | 99.65% | --.--% | --.--% | 99.66% |
|
Elements in the table differ only slightly (in the third and following decimal places). Test result original WD-SVM 99.67%, proposed WD-SVM 99.66%.
-
Relation to Positional Weight Matrices
Download WD-SVM Weights single_order_wd_betas.asc.gz
-
Toy Data
From the training data in the toy dataset above
we randomly generated 100 bootstrap replicates.
In addition to the paper we show a plot in which
the ROC score the SVM achieved on the validation
set on each trial is shown (bottom most row).
Important weights are shown in bright yellow
color in the upper rows. Statistically
significant weights with alpha=5% in the top
most row.
Columns correspond to different noise levels.
Note that the classification performance drop
drastically in the last column where 5 of 7
nucleotides in each motif got randomly
replaced.
-
Celegans acceptor
From the C. elegans dataset above we
randomly sampled 11500 examples and
used the first 1500 for training and the
remaining 10000 examples for testing. In
addition to the paper we show a plot in which
the ROC score achieved on each trial is shown.
Note the pretty high score of 97.5% on average.
Then we redid the experiments using more training examples (5000). As expected higher degrees also become significant.
Download bootstrap WD weights celegans_bootstrap_betas.asc.gz
-
Drosophila acceptor
From the Drosophila melanogaster
dataset above we randomly sampled 11500
examples and used the first 1500 for training
and the remaining 10000 examples for testing. In
addition to the paper we show a plot in which
the ROC score achieved on each trial is shown.
Note the (compared to the C. elegans
results above) low ROC score of 90% on average.
This might be due too the ratio of positive vs.
negative examples (only 24 positive examples are
expected to be among the 1500 training examples;
for C. elegans 88 examples are
expected).
Download bootstrap WD weights drosophila_bootstrap_betas.asc.gz
|