Supplementary
material for the papers:
Jianjun Hu, Bin Li, and Daisuke Kihara. (2005) Limitations and Potentials of Current
Motif Discovery Algorithms, Nucleic
Acids Res. 2005; 33(15): 4899–4913 
Jianjun Hu, Yifeng David Yang, and Daisuke Kihara. (2006) EMD: an ensemble algorithm for discovering regulatory motifs in
DNA sequences, BMC Bioinformatics
2006; 7: 342 
1. E. co
 
| 
   Data  | 
  
   Description  | 
  
   Note  | 
 
| 
   (local cache) obtained from RegulonDB
  Database  | 
  
      | 
 |
| 
   gene information of E. co  | 
  
      | 
 |
| 
   complete E. co  | 
  
      | 
 |
| 
   Separate files for each motif group compiled from
  RegulonDB  | 
  
   uncompress   | 
 
2. ECRDB70 data sets
 
| 
   Data  | 
  
   Description  | 
  
   Note  | 
 
| 
   70 motif groups screened out of RegulonDB. Some of
  the records will be skipped when generating input sequence data sets  | 
  
      | 
 |
| 
   A   | 
  
      | 
 |
| 
   Some statistics of the ECRDB70 motifs  | 
  
      | 
 
3. Input sequence data sets 
pls. refer to the paper for the procedures to generate the following input sequence data sets from ECRDB70
 
| 
   Data  | 
  
   Description  | 
  
   Note  | 
 
| 
   input sequences extracted from intergenic regions in
  which the motifs in ECRDB70 are located.  | 
  
      | 
 |
| 
   training sequences   | 
  
      | 
 |
| 
   training sequences   | 
  
      | 
 |
| 
   training sequences   | 
  
      | 
 |
| 
   training sequences   | 
  
      | 
 |
| 
   training sequences   | 
  
      | 
 |
| 
   training sequences   | 
  
      | 
 |
| 
   training sequences   | 
  
      | 
 |
| 
   training sequences   | 
  
      | 
  |
| 
   training sequences with margin size 20,50,100,200,300,400,500,800 on both sides of motifs(8 data sets)  | 
  
    Redundant input sequences were removed and motif groups which have just one input sequence after this processing were removed too.So there are just 61 motif groups left in each dataset  | 
  |
| 
   sequence files of motif groups   | 
  
      | 
 
4. Background
sequences
Two types of background models are generated based on:
1) The whole E.co
2) All the sequence segments located in the intergenic regions of E.co
5. Parameter
settings for benchmark experiments and the minimal-parameter-tuning guide
| 
   According to our minimal-parameter-tuning guide 
  | 
 
Supplementary material for paper:
Jianjun Hu,Yifeng D. Yang, and Daisuke Kihara. (2006)EMD: An Ensemble Algorithm for Discovering Regulatory Motifs in DNA Sequences, (submitted to BMC Bioinformaitcs)3. input sequence data sets with different margins generated from ECRDB70
| Data | Description | Note | 
| ECRDB61C-X | training sequences with margin size of 20,50,100,200,300,400,500,800 on both sides of motifs(8 data sets) | Modified from ECRDB61B-X datasets, the margin sequences are artificially shuffled, while preserving the di-mer nucleotide frequency of intergenic regions of the E. coli genome | 
5. Parameter settings for benchmark experiments and the minimal-parameter-tuning guideline
Contact Information:
Lilly Bld. B235
Department of Biological Sciences
Tel: 765-494-2744
Email: hujianju@purdue.edu