Click here to close
Hello! We notice that you are using Internet Explorer, which is not supported by Xenbase and may cause the site to display incorrectly.
We suggest using a current version of Chrome,
FireFox, or Safari.
BMC Bioinformatics
2008 Aug 27;9:353. doi: 10.1186/1471-2105-9-353.
Show Gene links
Show Anatomy links
Identification and correction of abnormal, incomplete and mispredicted proteins in public databases.
Nagy A
,
Hegyi H
,
Farkas K
,
Tordai H
,
Kozma E
,
Bányai L
,
Patthy L
.
???displayArticle.abstract???
Despite significant improvements in computational annotation of genomes, sequences of abnormal, incomplete or incorrectly predicted genes and proteins remain abundant in public databases. Since the majority of incomplete, abnormal or mispredicted entries are not annotated as such, these errors seriously affect the reliability of these databases. Here we describe the MisPred approach that may provide an efficient means for the quality control of databases. The current version of the MisPred approach uses five distinct routines for identifying abnormal, incomplete or mispredicted entries based on the principle that a sequence is likely to be incorrect if some of its features conflict with our current knowledge about protein-coding genes and proteins: (i) conflict between the predicted subcellular localization of proteins and the absence of the corresponding sequence signals; (ii) presence of extracellular and cytoplasmic domains and the absence of transmembrane segments; (iii) co-occurrence of extracellular and nuclear domains; (iv) violation of domain integrity; (v) chimeras encoded by two or more genes located on different chromosomes. Analyses of predicted EnsEMBL protein sequences of nine deuterostome (Homo sapiens, Mus musculus, Rattus norvegicus, Monodelphis domestica, Gallus gallus, Xenopus tropicalis, Fugu rubripes, Danio rerio and Ciona intestinalis) and two protostome species (Caenorhabditis elegans and Drosophila melanogaster) have revealed that the absence of expected signal peptides and violation of domain integrity account for the majority of mispredictions. Analyses of sequences predicted by NCBI's GNOMON annotation pipeline show that the rates of mispredictions are comparable to those of EnsEMBL. Interestingly, even the manually curated UniProtKB/Swiss-Prot dataset is contaminated with mispredicted or abnormal proteins, although to a much lesser extent than UniProtKB/TrEMBL or the EnsEMBL or GNOMON-predicted entries. MisPred works efficiently in identifying errors in predictions generated by the most reliable gene prediction tools such as the EnsEMBL and NCBI's GNOMON pipelines and also guides the correction of errors. We suggest that application of the MisPred approach will significantly improve the quality of gene predictions and the associated databases.
Figure 1. Error detected by MisPred routine for Conflict 1: the case of the Swiss-Prot entry LPLC4_HUMAN. The protein contains extracellular domains LBP_BPI_CETP and LBP_BPI_CETP_C but was found to lack both a signal peptide and transmembrane helices. The human sequence was corrected (LPLC4_HUMAN_corrected) by targeted search of the human genome with its mouse ortholog, CAM20161 [EMBL:CAM20161] that has a signal peptide. The alignment shows the N-terminal parts of LPLC4_HUMAN, CAM20161 and LPLC4_HUMAN_corrected. The predicted signal peptides of CAM20161 and LPLC4_HUMAN_corrected are in yellow and underlined.
Figure 2. Error detected by MisPred routine for Conflict 1: the case of the Swiss-Prot entry C209C_MOUSE. The protein contains an extracellular C-type lectin domain but was found to lack both a signal peptide and transmembrane helices, whereas all closely related proteins (e.g. C209A_MOUSE, C209D_MOUSE [Swiss-Prot:Q91ZX1, Q91ZW8]) are type II transmembrane proteins. The sequence of this protein was corrected by targeted search of mouse genomic and EST sequences. The alignment shows the N-terminal parts of C209C_MOUSE, C209C_MOUSE_corrected, C209A_MOUSE and C209D_MOUSE. The predicted transmembrane helices of C209C_MOUSE_corrected, C209A_MOUSE and C209D_MOUSE are in red and underlined.
Figure 3. Error detected by MisPred routine for Conflict 1: the case of the Swiss-Prot entry YL15_CAEEL The hypothetical homeobox protein C02F12.5 [EnsEMBL: C02F12.5] predicted for chromosome X contains an extracellular Kunitz_BPTI domain but was found to lack both a signal peptide and transmembrane helices. This protein, that also contains a nuclear Homeobox domain, arose through in silico fusion of a gene related to the homeobox protein HM07_CAEEL and a gene related to the Kunitz_BPTI containing protein CBG14258, Q619J1_CAEBR. (A) Alignment of YL15_CAEEL and Q619JI_CAEBR shows close homology only in the C-terminal region, highlighted in yellow. (B) Alignment of the YL15_CAEEL_corr1 and HM07_CAEEL. (C) Alignment of YL15_CAEEL_corr2 and Q619J1_CAEBR.
Figure 4. Error detected by MisPred routine for Conflict 4: the case of the Swiss-Prot entry EPHA5_RAT. This protein contains a C-terminal truncated SAM_1 domain that deviates significantly from the normal size of this domain family. It is noteworthy that orthologs from mouse, human and chicken contain an intact SAM_1 domain. The sequence of this protein was corrected by targeted search of the rat genome using the sequences of the full-length orthologs. The alignment shows the C-terminal parts of EPHA5_RAT, EPHA5_RAT_corrected, EPHA5_MOUSE [Swiss-Prot:Q60629], EPHA5_HUMAN [Swiss-Prot:P54756] and EPHA5_CHICK [Swiss-Prot:P54755]. The region of the predicted SAM_1 domain of EPHA5_RAT_corrected that is absent in EPHA5_RAT is underlined and highlighted in yellow.
Figure 5. Error detected by MisPred routine for Conflict 5: the case of the protein Q9NXI4_HUMAN. The cDNA of this hypothetical protein FLJ20227, cloned from colon mucosa is derived from a chimera of two genes located on chromosome 11 and chromosome 2. The N-terminal part of the protein (underlined and highlighted in yellow) is derived from the gene encoding the PR domain zinc finger protein 10, PRD10_HUMAN (A), the C-terminal part of the protein (underlined and highlighted in blue) is derived from the gene encoding liver fatty acid-binding protein, FABPL_HUMAN (B).
Figure 6. Error detected by MisPred routine for Conflict 2. ENSXETP00000040601 of Xenopus tropicalis corresponds to the frog ortholog of Ephrin receptor A7, but lacks a typical transmembrane helix between its extracellular FN3 and cytoplasmic Pkinase domains. The mispredicted sequence was corrected by identifying the missing transmembrane sequence using frog EST sequences such as EL820950 [GenBank:EL820950]. The alignment shows the regions containing the transmembrane helices of Gallus gallus Ephrin receptor A7 [RefSeq:NP_990414], ENSXETP00000040601 and ENSXETP00000040601_corrected. The predicted transmembrane helices of NP_990414 and ENSXETP00000040601_corrected are in red and underlined, the mispredicted region of ENSXETP00000040601 is in italics.
Adams,
The genome sequence of Drosophila melanogaster.
2000, Pubmed
Adams,
The genome sequence of Drosophila melanogaster.
2000,
Pubmed
Akiva,
Transcription-mediated gene fusion in the human genome.
2006,
Pubmed
Altschul,
Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.
1997,
Pubmed
Aparicio,
Whole-genome shotgun assembly and analysis of the genome of Fugu rubripes.
2002,
Pubmed
Bányai,
Evidence that human genes of modular proteins have retained significantly more ancestral introns than their fly or worm orthologues.
2004,
Pubmed
Bendtsen,
Feature-based prediction of non-classical and leaderless protein secretion.
2004,
Pubmed
Bendtsen,
Improved prediction of signal peptides: SignalP 3.0.
2004,
Pubmed
Birney,
GeneWise and Genomewise.
2004,
Pubmed
Brenner,
Structural and enzymatic characterization of a purified prohormone-processing enzyme: secreted, soluble Kex2 protease.
1992,
Pubmed
Brent,
Steady progress and recent breakthroughs in the accuracy of automated genome annotation.
2008,
Pubmed
C. elegans Sequencing Consortium,
Genome sequence of the nematode C. elegans: a platform for investigating biology.
1998,
Pubmed
Chen,
A novel substrate of receptor tyrosine phosphatase PTPRO is required for nerve growth factor-induced process outgrowth.
2005,
Pubmed
Chen,
Neuronal pentraxin with chromo domain (NPCD) is a novel class of protein expressed in multiple neuronal domains.
2005,
Pubmed
ENCODE Project Consortium,
The ENCODE (ENCyclopedia Of DNA Elements) Project.
2004,
Pubmed
Favre,
SLURP1 is a late marker of epidermal differentiation and is absent in Mal de Meleda.
2007,
Pubmed
Fink,
Towards defining the nuclear proteome.
2008,
Pubmed
Finn,
Pfam: clans, web tools and services.
2006,
Pubmed
Goldberg,
Protein degradation and protection against misfolded or damaged proteins.
2003,
Pubmed
Guigó,
EGASP: the human ENCODE Genome Annotation Assessment Project.
2006,
Pubmed
Hansen,
The congenital "ant-egg" cataract phenotype is caused by a missense mutation in connexin46.
2006,
Pubmed
Hiller,
PrediSi: prediction of signal peptides and their cleavage positions.
2004,
Pubmed
Hubbard,
Ensembl 2007.
2007,
Pubmed
Hudziak,
Cell transformation potential of a HER2 transmembrane domain deletion mutant retained in the endoplasmic reticulum.
1991,
Pubmed
International Human Genome Sequencing Consortium,
Finishing the euchromatic sequence of the human genome.
2004,
Pubmed
Jayakumar,
Consequences of C-terminal domains and N-terminal signal peptide deletions on LEKTI secretion, stability, and subcellular distribution.
2005,
Pubmed
Käll,
Advantages of combined transmembrane topology and signal peptide prediction--the Phobius web server.
2007,
Pubmed
Kent,
BLAT--the BLAST-like alignment tool.
2002,
Pubmed
Krogh,
Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes.
2001,
Pubmed
Lander,
Initial sequencing and analysis of the human genome.
2001,
Pubmed
Letunic,
SMART 4.0: towards genomic data integration.
2004,
Pubmed
Mott,
Predicting protein cellular localization using a domain projection method.
2002,
Pubmed
Mukherjee,
HDDD2 is a familial frontotemporal lobar degeneration with ubiquitin-positive, tau-negative inclusions caused by a missense mutation in the signal peptide of progranulin.
2006,
Pubmed
Ohnishi,
Rapid proteasomal degradation of translocation-deficient UDP-glucuronosyltransferase 1A1 proteins in patients with Crigler-Najjar type II.
2003,
Pubmed
Parra,
Tandem chimerism as a means to increase protein complexity in the human genome.
2006,
Pubmed
Pennisi,
Genetics. Working the (gene count) numbers: finally, a firm answer?
2007,
Pubmed
Saarela,
A novel aspartylglucosaminuria mutation affects translocation of aspartylglucosaminidase.
2004,
Pubmed
Tordai,
Insertion of spliceosomal introns in proto-splice sites: the case of secretory signal peptides.
2004,
Pubmed
Tordai,
Modules, multidomain proteins and organismic complexity.
2005,
Pubmed
Tress,
The implications of alternative splicing in the ENCODE protein complement.
2007,
Pubmed
UniProt Consortium,
The Universal Protein Resource (UniProt).
2007,
Pubmed
Unneberg,
Tentative mapping of transcription-induced interchromosomal interaction using chimeric EST and mRNA data.
2007,
Pubmed
Venter,
The sequence of the human genome.
2001,
Pubmed
Wang,
Intracellular retention of mutant retinoschisin is the pathological mechanism underlying X-linked retinoschisis.
2002,
Pubmed
Watters,
The highly cooperative folding of small naturally occurring proteins is likely the result of natural selection.
2007,
Pubmed
Wheelan,
Domain size distributions can predict domain boundaries.
2000,
Pubmed
Wheeler,
Database resources of the National Center for Biotechnology Information.
2007,
Pubmed
Wolf,
Long-term trends in evolution of indels in protein sequences.
2007,
Pubmed