B213-207Dom
Domain boundaries prediction using a multi-layered neural
network
E.
Tapia1, Y.H. Tan1 and D. Kihara2, 1
1 – Dept. of
Computer Science, Purdue University, 2 – Dept. of Biological Sciences, Purdue University, West
Lafayette, IN, USA
dkihara@purdue.edu
Understanding the domain
organization of a protein is crucial for the structural determination of large
proteins using techniques with an inherent size limitation. To predict domain
boundaries in CASP6 targets, we implemented a multi-layered artificial neural
network, which essentially uses only the sequence information of the target. We
also compared the results with outputs from different prediction servers6,
7. The architecture of the neural
network approach used can be divided in the following levels: (1) a
multilayered neural network1-3 that assigns boundaries prediction
using sequence information, (2) a second neural network that refines the output
obtained from the first one, and (3) a statistical approach to combine the
output from different networks and different databases.
A fully connected neural network
with 11 input groups (optimal window size 11) was designed for the first level
in our architecture. Each input group consists of 24 units for each residue in
the window and two extra units to store values relative to the whole window.
This neural network was trained using several types of information obtained
from the sequence: (1) a multiple sequence alignment obtained using psi-blast9 on the target sequence;
(2) Secondary structure prediction obtained using the Psipred8 prediction server; (3) average
Kyle-Doolittle hydrophobicity index of a window and (4) domain delineation
index4, which distinguishes regions with high concentration of N-
and C-termini of aligned homologous sequences in the multiple sequence
alignment. For training and testing purposes we extracted 600 proteins from the
SCOP database5 with a uniform distribution among families and
subfamilies. We distributed randomly these sequences into 10 different
databases and we trained several networks for each database. The final level in
our architecture merges all the information from the different networks and
databases to obtain an optimal domain boundary prediction.
The results of our work
show that using multilayer network improves the performance in comparison to a
single network. The method is especially effective while using different databases
and running several networks for each database and combining the results. Further
improvement could be expected by incorporating additional information, such as the
average number of domains respect to the sequence length or the average distance
of the domain boundaries to the N- and C-termini.
1.
Baldi P. & Brunak S. (2001) Bioinformatics: The
machine Learning Approach, 2nd edition, MIT Press.
2.
Krogh A. & Vedelsby J. (1995) Neural network
ensembles cross validation, and active learning. Tesauro, G., Touretzky, D.
& Leen, T., (eds.) NIPS 7. The MIT Press, pp. 231–238.
3.
Baldi, P., Brunak, S., Chauvin, Y. & Nielsen, H.
(1999) Assessing the accuracy of prediction algorithms for classification: an
overview. Bioinformatics, 16, 412-424.
4.
George,R.A. and Heringa,J. (2002) Protein domain
identification and improved sequence similarity searching using PSI-BLAST.
Proteins, 48, 672–681
5.
Murzin A. G., Brenner S. E., Hubbard T., Chothia C.
(1995) SCOP: a structural classification of proteins database for the
investigation of sequences and structures. J. Mol. Biol. 247, 536-540.
6.
Marsden, R.L., McGuffin, L.J. & Jones, D.T. (2002)
Rapid protein domain assignment from amino acid sequence using predicted
secondary structure. Protein Science, 11, 2814-2824.
7.
Suyama M. & Ohara O., (2003) DomCut: prediction of
inter-domain linker regions in amino acid sequences, Bioinformatics 19,
673-674.
8.
McGuffin LJ, Bryson K, Jones, D.T. (2000) The PSIPRED
protein structure prediction server. Bioinformatics. 16, 404-405.
9.
Altschul,S.F., Madden,T.L., Schaffer,A.A., Zhang,J.,
Zhang,Z., Miller,W. & Lipman,D.J. (1997) Gapped BLAST and PSI-BLAST: a new
generation of protein database search programs. Nucleic Acids Res. 25, 3389-3402.