Supplementary
material for the papers:
Jianjun Hu, Bin Li, and Daisuke Kihara. (2005) Limitations and Potentials of Current
Motif Discovery Algorithms, Nucleic
Acids Res. 2005; 33(15): 4899–4913
Jianjun Hu, Yifeng David Yang, and Daisuke Kihara. (2006) EMD: an ensemble algorithm for discovering regulatory motifs in
DNA sequences, BMC Bioinformatics
2006; 7: 342
1. E. co
Data |
Description |
Note |
(local cache) obtained from RegulonDB
Database |
|
|
gene information of E. co |
|
|
complete E. co |
|
|
Separate files for each motif group compiled from
RegulonDB |
uncompress |
2. ECRDB70 data sets
Data |
Description |
Note |
70 motif groups screened out of RegulonDB. Some of
the records will be skipped when generating input sequence data sets |
|
|
A |
|
|
Some statistics of the ECRDB70 motifs |
|
3. Input sequence data sets
pls. refer to the paper for the procedures to generate the following input sequence data sets from ECRDB70
Data |
Description |
Note |
input sequences extracted from intergenic regions in
which the motifs in ECRDB70 are located. |
|
|
training sequences |
|
|
training sequences |
|
|
training sequences |
|
|
training sequences |
|
|
training sequences |
|
|
training sequences |
|
|
training sequences |
|
|
training sequences |
|
|
training sequences with margin size 20,50,100,200,300,400,500,800 on both sides of motifs(8 data sets) |
Redundant input sequences were removed and motif groups which have just one input sequence after this processing were removed too.So there are just 61 motif groups left in each dataset |
|
sequence files of motif groups |
|
4. Background
sequences
Two types of background models are generated based on:
1) The whole E.co
2) All the sequence segments located in the intergenic regions of E.co
5. Parameter
settings for benchmark experiments and the minimal-parameter-tuning guide
According to our minimal-parameter-tuning guide
|
Supplementary material for paper:
Jianjun Hu,Yifeng D. Yang, and Daisuke Kihara. (2006)EMD: An Ensemble Algorithm for Discovering Regulatory Motifs in DNA Sequences, (submitted to BMC Bioinformaitcs)3. input sequence data sets with different margins generated from ECRDB70
Data | Description | Note |
ECRDB61C-X | training sequences with margin size of 20,50,100,200,300,400,500,800 on both sides of motifs(8 data sets) | Modified from ECRDB61B-X datasets, the margin sequences are artificially shuffled, while preserving the di-mer nucleotide frequency of intergenic regions of the E. coli genome |
5. Parameter settings for benchmark experiments and the minimal-parameter-tuning guideline
Contact Information:
Lilly Bld. B235
Department of Biological Sciences
Tel: 765-494-2744
Email: hujianju@purdue.edu