Author affiliations:
1. Department of Computer Science, College of Science, Purdue University, West Lafayette, IN USA 47907
2. Department of Biological Sciences, College of Science, Purdue University, West Lafayette, IN USA 47907
3. Markey Center for Structural Biology, College of Science, Purdue University, West Lafayette, IN USA 47907
*corresponding author
email:dkihara@purdue.edu
web:http://kiharalab.org
High throughput sequencing has increased the availability of newly sequenced genomes, all of which can not be experimentally studied for detecting function of individual genes. Automated function prediction techniques are popularly helping researchers come up with high confidence computational function predictions about these genes. These techniques require an underlying well-built functional vocabulary that can capture biological knowledge and help represent cellular functions as a set of structured terms. Here we work with one such vocabulary Gene Ontology (GO), to solve the problem of capturing the functional similarity between the terms in the vocabulary. This similarity is later used to build a scoring scheme to find similarity between pair of proteins and ultimately a set of proteins. We have developed two similarity scores based on the strength of association between the terms according to knowledge extracted from PubMed and Gene Ontology Annotation databases. The scores are shown to have ability to accurately separate biologically relevant groups of proteins from random ones as well as to have good discriminative power for detecting correct interacting pairs of proteins.
This page provides links to supplementary figures, datasets used in the above analysis as well as the results obtained. Please send us an email if there are any questions. Readme file for data formats can be found here
For 101 KEGG pathways for yeast, Biological Process (BP) GO annotations assigned to proteins in each pathway are counted. The pathway name, the number of proteins in the pathway, and the number of unique GO BP annotations are shown in this file.