Quantification of protein group coherence and pathway assignment using functional association

Authors: Meghana Chitale 1, Shriphani Palakodety 1, Daisuke Kihara 1, 2, 3,*


Author affiliations:
1. Department of Computer Science, College of Science, Purdue University, West Lafayette, IN USA 47907
2. Department of Biological Sciences, College of Science, Purdue University, West Lafayette, IN USA 47907
3. Markey Center for Structural Biology, College of Science, Purdue University, West Lafayette, IN USA 47907
*corresponding author
email:dkihara@purdue.edu
web:http://kiharalab.org

High throughput sequencing has increased the availability of newly sequenced genomes, all of which can not be experimentally studied for detecting function of individual genes. Automated function prediction techniques are popularly helping researchers come up with high confidence computational function predictions about these genes. These techniques require an underlying well-built functional vocabulary that can capture biological knowledge and help represent cellular functions as a set of structured terms. Here we work with one such vocabulary Gene Ontology (GO), to solve the problem of capturing the functional similarity between the terms in the vocabulary. This similarity is later used to build a scoring scheme to find similarity between pair of proteins and ultimately a set of proteins. We have developed two similarity scores based on the strength of association between the terms according to knowledge extracted from PubMed and Gene Ontology Annotation databases. The scores are shown to have ability to accurately separate biologically relevant groups of proteins from random ones as well as to have good discriminative power for detecting correct interacting pairs of proteins.

This page provides links to supplementary figures, datasets used in the above analysis as well as the results obtained. Please send us an email if there are any questions. Readme file for data formats can be found here


Supplementary Figures

  • Figure S1: Coherence score distributions for Random sets
  • Figure S2: Coherence score distributions for Pathway sets
  • Figure S3: Coherence score distributions for Protein complex sets
  • Figure S4: Coherence score distributions for GOcc sets

  • Supplementary Data

  • Supplementary file1: Analysis of GO Biological Process (BP) annotations of proteins in KEGG yeast pathways
  • For 101 KEGG pathways for yeast, Biological Process (BP) GO annotations assigned to proteins in each pathway are counted. The pathway name, the number of proteins in the pathway, and the number of unique GO BP annotations are shown in this file.

    Datasets

    (Refer to Materials section in the paper for more details on how datasets have been prepared)

    Protein groups analyzed

  • Yeast KEGG Pathway sets
  • Yeast Protein complex sets
  • Yeast GOcc (Gene Ontology cellular component based) sets
  • Yeast Random protein sets
  • Protein-Protein Interactions (PPIs)

  • Yeast PPI from BioGrid
  • Human PPI from BioGrid

  • Resultsets

    (Refer to Methods section in the paper for more details on computational techniques used here)

    Coherence computation results using various techniques on above datasets

  • Yeast KEGG Pathway sets results
    1. CAS_coherence results
    2. PAS_coherence results
    3. funsim_coherence results
    4. Chagoyen_coherence results
    5. Pandey_coherence results
  • Yeast Protein complex sets results
    1. CAS_coherence results
    2. PAS_coherence results
    3. funsim_coherence results
    4. Chagoyen_coherence results
    5. Pandey_coherence results
  • Yeast GOcc sets results
    1. CAS_coherence results
    2. PAS_coherence results
    3. funsim_coherence results
    4. Chagoyen_coherence results
    5. Pandey_coherence results
  • Yeast Random sets results
    1. CAS_coherence results
    2. PAS_coherence results
    3. funsim_coherence results
    4. Chagoyen_coherence results
    5. Pandey_coherence results

    Interacting pair similarity results using various techniques

  • Yeast PPIs results
    1. CAS_sim results
    2. PAS_sim results
    3. funsim results
    4. Chagoyen_sim results
    5. Pandey_sim results
  • Human PPIs results
    1. CAS_sim results
    2. PAS_sim results
    3. funsim results
    4. Chagoyen_sim results
    5. Pandey_sim results