Supplementary Matrial for DNA motif search papers

Supplementary material for the papers:

Jianjun Hu, Bin Li, and Daisuke Kihara. (2005) Limitations and Potentials of Current Motif Discovery Algorithms, Nucleic Acids Res. 2005; 33(15): 4899�4913

Jianjun Hu, Yifeng David Yang, and Daisuke Kihara. (2006) EMD: an ensemble algorithm for discovering regulatory motifs in DNA sequences, BMC Bioinformatics 2006; 7: 342

1. E. coli genome data sets

Data	Description	Note
RegulonDB	(local cache) obtained from RegulonDB Database
ecoli.genes	gene information of E. coli
ecoli.genome	complete E. coli genome sequence
ecoli.motifs.zip	Separate files for each motif group compiled from RegulonDB	uncompress with unzip under linux

2. ECRDB70 data sets

Data	Description	Note
ECRDB70.txt	70 motif groups screened out of RegulonDB. Some of the records will be skipped when generating input sequence data sets
ECRDB70.list	A list of motif groups in ECRDB with their motif widths and other information
ECRDB70.stat	Some statistics of the ECRDB70 motifs

3. Input sequence data sets with different margins generated from ECRDB70

pls. refer to the paper for the procedures to generate the following input sequence data sets from ECRDB70

Data	Description	Note
ECRDB62A	input sequences extracted from intergenic regions in which the motifs in ECRDB70 are located.
ECRDB70B-20	training sequences with margin size of 20 on both sides of motifs
ECRDB70B-50	training sequences with margin size of 50 on both sides of motifs
ECRDB70B-100	training sequences with margin size of 100 on both sides of motifs
ECRDB70B-200	training sequences with margin size of 200 on both sides of motifs
ECRDB70B-300	training sequences with margin size of 300 on both sides of motifs
ECRDB70B-400	training sequences with margin size of 400 on both sides of motifs
ECRDB70B-500	training sequences with margin size of 500 on both sides of motifs
ECRDB70B-800	training sequences with margin size of 800 on both sides of motifs
ECRDB61B-all	training sequences with margin size 20,50,100,200,300,400,500,800 on both sides of motifs(8 data sets)	Redundant input sequences were removed and motif groups which have just one input sequence after this processing were removed too.So there are just 61 motif groups left in each dataset
resampling	sequence files of motif groups with at least 40 sequences, used for benchmarking how the number of sequences affects prediction performance

4. Background sequences

Two types of background models are generated based on:
1) The whole E.coli genome sequence: Download �
2) All the sequence segments located in the intergenic regions of E.co li genomes: Download. This file is generated based on the E. coli genome and the gene information in E.coli genes. It includes intergenic segments from both strands of the E. coli genome.

5. Parameter settings for benchmark experiments and the minimal-parameter-tuning guideli ne

According to our minimal-parameter-tuning guideli ne, we list all the major running parameters of the five motif discovery programs used in our experiments including AlignACE, BioProspector, MDScan, MEME, and MotifSampler. Most of the parameters are unset or use the default settings. Check the parameters here.

Supplementary material for paper:

Jianjun Hu,Yifeng D. Yang, and Daisuke Kihara. (2006)EMD: An Ensemble Algorithm for Discovering Regulatory Motifs in DNA Sequences, (submitted to BMC Bioinformaitcs)

1. E.coli genome data sets

2. genomRDB70 data sets

3. input sequence data sets with different margins generated from ECRDB70

Data	Description	Note
ECRDB61C-X	training sequences with margin size of 20,50,100,200,300,400,500,800 on both sides of motifs(8 data sets)	Modified from ECRDB61B-X datasets, the margin sequences are artificially shuffled, while preserving the di-mer nucleotide frequency of intergenic regions of the E. coli genome

4. Background Sequences

5. Parameter settings for benchmark experiments and the minimal-parameter-tuning guideline

Contact Information:

Lilly Bld. B235

Department of Biological Sciences

Purdue University

West Lafayette, IN, 47906

Tel: 765-494-2744

Email: hujianju@purdue.edu

dkihara@purdue.edu

yang41@purdue.edu