Click here to close
Hello! We notice that you are using Internet Explorer, which is not supported by Xenbase and may cause the site to display incorrectly.
We suggest using a current version of Chrome,
FireFox, or Safari.
Applying Support Vector Machines for Gene Ontology based gene function prediction.
Vinayagam A
,
König R
,
Moormann J
,
Schubert F
,
Eils R
,
Glatting KH
,
Suhai S
.
???displayArticle.abstract???
The current progress in sequencing projects calls for rapid, reliable and accurate function assignments of gene products. A variety of methods has been designed to annotate sequences on a large scale. However, these methods can either only be applied for specific subsets, or their results are not formalised, or they do not provide precise confidence estimates for their predictions. We have developed a large-scale annotation system that tackles all of these shortcomings. In our approach, annotation was provided through Gene Ontology terms by applying multiple Support Vector Machines (SVM) for the classification of correct and false predictions. The general performance of the system was benchmarked with a large dataset. An organism-wise cross-validation was performed to define confidence estimates, resulting in an average precision of 80% for 74% of all test sequences. The validation results show that the prediction performance was organism-independent and could reproduce the annotation of other automated systems as well as high-quality manual annotations. We applied our trained classification system to Xenopus laevis sequences, yielding functional annotation for more than half of the known expressed genome. Compared to the currently available annotation, we provided more than twice the number of contigs with good quality annotation, and additionally we assigned a confidence value to each predicted GO term. We present a complete automated annotation system that overcomes many of the usual problems by applying a controlled vocabulary of Gene Ontology and an established classification method on large and well-described sequence data sets. In a case study, the function for Xenopus laevis contig sequences was predicted and the results are publicly available at ftp://genome.dkfz-heidelberg.de/pub/agd/gene_association.agd_Xenopus.
Figure 1. A schematic representation of possible GO term relationships: A: GO1 is a "parent" of GO2 in a single path relationship. B: GO1 is a "parent" of GO2 in a multiple path relationship. C: GO1 is a "child" of GO2 in a single path relationship. D: GO1 is a "child" of GO2 in a multiple path relationship. E: GO1 and GO2 are "siblings" in a single path relationship. F: GO1 and GO2 are "siblings" in a multiple path relationship. MF denotes the molecular function node (root).
Figure 2. General prediction scheme: The training sequences (S1) with known function (GOx, GOy, GOz) were searched across the protein databases, yielding hits with molecular function GO terms (GO1, GO2, GO3, GO4, GO5, GO6) and their features (see methods), sketched as dots in a two-dimensional feature space. If GO terms of the hits compared to GO terms of the query, they were classified as +1 (correct, green), and -1 otherwise (red). The classifier (SVM) separated the classes by an optimal separating hyperplane (OSH). Unknown sequences (S2) were searched in the same manner and the GO terms (GOn, GOm, GOo) were extracted. Their features were calculated and mapped into the feature space. The corresponding labels were assigned (correct/false).
Figure 3. Accuracy and precision against the number of votes: The accuracy and precision values of the test data is plotted against the number of votes. An increasing number of votes increased the precision monotonically. Higher stringency yielded a sparse lowering of the accuracy due to the rate of false negatives. The relation between the precision and the number of votes was used for assigning confidence values for new predictions.
Figure 4. ROC plots for the classifiers performance: ROC plots for the results of all organisms tested and the average of all test sequences. The classification performance for different classes of organisms like multi-cellular eukaryotes, single-cell eukaroyotes and the prokaryotes were compared.
Figure 5. Precision against the sequence coverage: Average precision against sequence coverage for all 13-test organisms (circles). The red line denotes a fitting curve.
Figure 6. Comparison of GO slims between Xenopus, fly, yeast and mouse: Distributions of higher-level GO terms (,,GO slim", see text) for Xenopus, fly, yeast and mouse. The sum of all high-level terms may exceed the total number of the annotated terms, since some terms may have more than one high-level "parent" terms due to multiple paths.
Altschul,
Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.
1997, Pubmed
Altschul,
Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.
1997,
Pubmed
Andrade,
Automated genome sequence analysis and annotation.
1999,
Pubmed
Ashburner,
Gene ontology: tool for the unification of biology. The Gene Ontology Consortium.
2000,
Pubmed
Bailey,
GAIA: framework annotation of genomic sequence.
1998,
Pubmed
Bork,
Go hunting in sequence databases but watch out for the traps.
1996,
Pubmed
Bork,
Predicting functions from protein sequences--where are the bottlenecks?
1998,
Pubmed
Bork,
Applying motif and profile searches.
1996,
Pubmed
Camon,
The Gene Ontology Annotation (GOA) project: implementation of GO in SWISS-PROT, TrEMBL, and InterPro.
2003,
Pubmed
Del Val,
cDNA2Genome: a tool for mapping and annotating cDNAs.
2003,
Pubmed
Ernst,
A task framework for the web interface W2H.
2003,
Pubmed
Frishman,
Functional and structural genomics using PEDANT.
2001,
Pubmed
Gaasterland,
MAGPIE: automated genome interpretation.
1996,
Pubmed
Galperin,
Sources of systematic error in functional annotation of genomes: domain rearrangement, non-orthologous gene displacement and operon disruption.
1998,
Pubmed
Gene Ontology Consortium,
Creating the gene ontology resource: design and implementation.
2001,
Pubmed
Harris,
Genotator: a workbench for sequence annotation.
1997,
Pubmed
Hennig,
Automated Gene Ontology annotation for anonymous sequence data.
2003,
Pubmed
Hill,
Extension and integration of the gene ontology (GO): combining GO vocabularies with external vocabularies.
2002,
Pubmed
Jensen,
Prediction of human protein function according to Gene Ontology categories.
2003,
Pubmed
Kitson,
Functional annotation of proteomic sequences based on consensus of sequence and structural analysis.
2002,
Pubmed
Lewis,
Annotating eukaryote genomes.
2000,
Pubmed
Sakata,
RiceGAAS: an automated annotation system and database for rice genome sequence.
2002,
Pubmed
Schug,
Predicting gene ontology functions from ProDom and CDD protein domains.
2002,
Pubmed
Searls,
Using bioinformatics in gene and drug discovery.
2000,
Pubmed
Senger,
W2H: WWW interface to the GCG sequence analysis package.
1998,
Pubmed
Smith,
Functional genomics--bioinformatics is ready for the challenge.
1998,
Pubmed
Xie,
Large-scale protein annotation through gene ontology.
2002,
Pubmed
Zehetner,
OntoBlast function: From sequence similarities directly to potential functional annotations by ontology terms.
2003,
Pubmed