Click here to close
Hello! We notice that you are using Internet Explorer, which is not supported by Xenbase and may cause the site to display incorrectly.
We suggest using a current version of Chrome,
FireFox, or Safari.
???displayArticle.abstract???
BACKGROUND: Non-sequence gene data (images, literature, etc.) can be found in many different public databases. Access to these data is mostly by text based methods using gene names; however, gene annotation is neither complete, nor fully systematic between organisms, and is also not generally stable over time. This provides some challenges for text based access, especially for cross-species searches. We propose a method for non-sequence data retrieval based on sequence similarity, which removes dependence on annotation and text searches. This work was motivated by the need to provide better access to large numbers of in situ images, and the observation that such image data were usually associated with a specific gene sequence. Sequence similarity searches are found in existing gene oriented databases, but mostly give indirect access to non-sequence data via navigational links.
RESULTS: Three applications were built to explore the proposed method: accessing image data, literature and gene names. Searches are initiated with the sequence of the user's gene of interest, which is searched against a database of sequences associated with the target data. The matching (non-sequence) target data are returned directly to the user's browser, organised by sequence similarity. The method worked well for the intended application in image data management. Comparison with text based searches of the image data set showed the accuracy of the method. Applied to literature searches it facilitated retrieval of mostly high relevance references. Applied to gene name data it provided a useful analysis of name variation of related genes within and between species.
CONCLUSION: This method makes a powerful and useful addition to existing methods for searching gene data based on text retrieval or curated gene lists. In particular the method facilitates cross-species comparisons, and enables the handling of novel or otherwise un-annotated genes. Applications using the method are quick and easy to build, and the data require little maintenance. This approach largely circumvents the need for annotation, which can be a major obstacle to the development of genomic scale data resources.
Figure 1. Generic application logic used in indirect sequence similarity search for gene data. (1.) the user pastes a gene sequence into the browser window and sends it to the search engine; (2.) the gene sequence is blasted against the database of sequences associated with the gene data; (3.) IDs of matching sequence are returned to the search engine; (4.) the matching sequence IDs are used to query the local managing database for available gene data; (5.) a list of matching gene data and descriptive text is returned to the search engine; (6.) an html formatted page containing the retrieved gene data and descriptive text is returned to the user's browser.
Figure 2. Example output of quickImage. The query sequence was X. tropicalis myf5, used to retrieve image data for this and related genes. The upper panel shows alignment and similarity between the query sequence and the matching image source sequences. The first three sets of retrieved images are shown; for each set, the accession number of the image source sequence and the best BLAST matches against human, mouse and Xenopus proteins are provided for identification purposes, as well as the originating image collection and species. Images marked A and B show highly similar expression of myf5 in the two frog species at the same development stage. The image marked C shows an interestingly similar expression pattern for the related gene myod/myf3 at a slightly later stage.
Figure 3. Example output of quickLit. The query sequence was X. tropicalis brachyury, used to retrieve literature references for this and related genes. The retrieved references are shown for the first few matching sequences. The retrieved data shows a high degree of apparent relevance as indicated by the title of each paper, and clear organisation of reference by species. Reference summaries and associated sequence data were downloaded from NCBI GenBank and various model organism databases.
Figure 4. Example output of quickGene. The query sequence was X. tropicalis brachyury, used to search gene name data from Entrez Gene. Note the variable nature of the retrieved gene names for this set of related genes.
Altschul,
Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.
1997, Pubmed
Altschul,
Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.
1997,
Pubmed
Ashburner,
Gene ontology: tool for the unification of biology. The Gene Ontology Consortium.
2000,
Pubmed
Becker,
PubMatrix: a tool for multiplex literature mining.
2003,
Pubmed
Benson,
GenBank.
2007,
Pubmed
Cannata,
A Semantic Web for bioinformatics: goals, tools, systems, applications.
2008,
Pubmed
Dwight,
Saccharomyces genome database: underlying principles and organisation.
2004,
Pubmed
Finkel,
Exploring the boundaries: gene and protein identification in biomedical text.
2005,
Pubmed
Flicek,
Ensembl 2008.
2008,
Pubmed
Fundel,
Gene and protein nomenclature in public databases.
2006,
Pubmed
Gilchrist,
Defining a large set of full-length clones from a Xenopus tropicalis EST project.
2004,
Pubmed
,
Xenbase
Guffanti,
The meaning of it all: web-based resources for large-scale functional annotation and visualization of DNA microarray data.
2002,
Pubmed
Haudry,
4DXpress: a database for cross-species expression pattern comparisons.
2008,
Pubmed
Hoffmann,
A gene network for navigating the literature.
2004,
Pubmed
Huss,
A gene wiki for community annotation of gene function.
2008,
Pubmed
Kersey,
Linking publication, gene and protein data.
2006,
Pubmed
Krauthammer,
Term identification in the biomedical literature.
2004,
Pubmed
Maglott,
Entrez Gene: gene-centered information at NCBI.
2007,
Pubmed
Malik,
Combination of text-mining algorithms increases the performance.
2006,
Pubmed
Müller,
Textpresso: an ontology-based information retrieval and extraction system for biological literature.
2004,
Pubmed
Podowski,
Suregene, a scalable system for automated term disambiguation of gene and protein names.
2005,
Pubmed
Pruitt,
NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins.
2007,
Pubmed
Raychaudhuri,
The computational analysis of scientific literature to define and recognize gene expression clusters.
2003,
Pubmed
Rogers,
WormBase 2007.
2008,
Pubmed
Seringhaus,
Uncovering trends in gene naming.
2008,
Pubmed
Sprague,
The Zebrafish Information Network (ZFIN): the zebrafish model organism database.
2003,
Pubmed
Wilson,
FlyBase: integration and improvements to query tools.
2008,
Pubmed
Yoneya,
PSE: a tool for browsing a large amount of MEDLINE/PubMed abstracts with gene names and common words as the keywords.
2005,
Pubmed