Quantification of protein group coherence and pathway assignment using functional association

Authors: Meghana Chitale ¹, Shriphani Palakodety ¹, Daisuke Kihara ^{1, 2, 3,*}

Author affiliations:
1. Department of Computer Science, College of Science, Purdue University, West Lafayette, IN USA 47907
2. Department of Biological Sciences, College of Science, Purdue University, West Lafayette, IN USA 47907
3. Markey Center for Structural Biology, College of Science, Purdue University, West Lafayette, IN USA 47907
*corresponding author
email:dkihara@purdue.edu
web:http://kiharalab.org

High throughput sequencing has increased the availability of newly sequenced genomes, all of which can not be experimentally studied for detecting function of individual genes. Automated function prediction techniques are popularly helping researchers come up with high confidence computational function predictions about these genes. These techniques require an underlying well-built functional vocabulary that can capture biological knowledge and help represent cellular functions as a set of structured terms. Here we work with one such vocabulary Gene Ontology (GO), to solve the problem of capturing the functional similarity between the terms in the vocabulary. This similarity is later used to build a scoring scheme to find similarity between pair of proteins and ultimately a set of proteins. We have developed two similarity scores based on the strength of association between the terms according to knowledge extracted from PubMed and Gene Ontology Annotation databases. The scores are shown to have ability to accurately separate biologically relevant groups of proteins from random ones as well as to have good discriminative power for detecting correct interacting pairs of proteins.

This page provides links to supplementary figures, datasets used in the above analysis as well as the results obtained. Please send us an email if there are any questions. Readme file for data formats can be found here

Supplementary Figures

Figure S1: Coherence score distributions for Random sets

Figure S2: Coherence score distributions for Pathway sets

Figure S3: Coherence score distributions for Protein complex sets

Figure S4: Coherence score distributions for GOcc sets

Supplementary Data

Supplementary file1: Analysis of GO Biological Process (BP) annotations of proteins in KEGG yeast pathways

For 101 KEGG pathways for yeast, Biological Process (BP) GO annotations assigned to proteins in each pathway are counted. The pathway name, the number of proteins in the pathway, and the number of unique GO BP annotations are shown in this file.