What is GenPortrait

Algorithm

GenPortrait is designed to view the "portrait of a genome". A prominent fractal-like patterns are observed in these portraits, which is specific to each genome. The pattern of a genome is quite different from that of a random sequence and similiar species show a similiar pattern. The method counts the frequencies of short n-length DNA sequences in an input genome and store in a 2D matrix. The matrix can be then visualized in a gray scale or in a color scale.

For example, when the oligo nucelotide length is set to 2, the frequencies of AA, AC, AG, AT, .,,and TT is counted by shifting a window of the size of 2. This will result in in total of 16 counts, each of which will be stored in a 2D matrix in a following way:

AA AC CA CC
AT AG CT CG
TA TC GA GC
TT TG GT GG

5 color scale are available to visualize the matrix: JET, HSV, COOL, SPRING, and gray scale.

SchemeMapping colors
GRAY
JET
HSV
COOL
SPRING

The pictures below are the portraits of e.coli (128*128) with length = 7 (2^7=128). The potrait is generated in four color scales.
Please go to Examples to see more examples.

HSVJETCOOLSPRING

Comparison of two portraits
is based on the sum of differences of the frequency of each oligo-nucleotides. First, counts of oligo-nucleotides in each portrait are normalized by deviding the counts with the average counts in the portrait. Next, the absolute value of the difference of the normalized counts of the same oligo-nucleotide in the two portraits is computed and then all of them are summed up. The distance may be slightly different with different oligo-nucleotide length.

GenPortrait Database
We have a database of portraits of genomes. The genome sequences are RefSeq sequences speficied in the KEGG organism list. They are downloaded from the NCBI ftp site. Only the files with the name, NC_*.fna (most of them are complete genomes) are included. RNA sequences, sequences of less than 30Kb are ommitted. Currently there are 618 genomes. You can excute a search against this database from your input sequence.

Multiple sequences in a file
If an input file contains multiple fasta format sequences, frequencies of oligo-nucleotides are counted for each sequence and the server still generate portraits. Note that sequences are not concatenated. This may be useful to capture characteristics of a population of fragment sequences (,which may be taken from a metagenomics project).

What you can do

Tutorial

If you don't have a nucleotide sequence to analyze, you can select a sequence from Downloads to download it onto your local disk. Then try to upload it from Home to excecute GenPortrait.
Or you can see our examples of the portrait in our database under Examples menu. You can further query the seleced portrait against the database to find genome sequences of the similar portrait.

Links

Genome sequences can be downloaded from following sources: