A Vocabulon Study of E.Coli Regulatory Sites with Feedback to Expression Array Analysis

Chiara Sabatti, Lars Rohlin, Kenneth Lange, and James Liao

The identification of binding sites for regulatory proteins in the up-stream region of genes is an important ingredient towards the understanding of transcription regulation. In recent years, novel experimental techniques, as gene expression arrays, and the availability of entire genome sequences have opened the possibility for more detailed investigations in this domain. Traditionally, the reconstruction of the profile of a binding site and the localization of all its occurrences in a sequence are treated as separate problems. The first is tackled using a small group of sequences, known or suspected to contain the binding site, but with neither position or pattern known. One successful approach to such reconstruction problem is based on a probabilistic model of the sequence, represented as concatenation of background and motif stochastic words. Maximum likelihood or maximum a-posteriori estimates are obtained with EM or Gibbs-sampler algorithms [13, 14].

The second problem is approached considering one or multiple sequences of variable length; the pattern characterizing the motif is assumed known. Possible locations are identified on the
base of scoring functions that highlight the similarity of the motif with the sequence portions. Cut
off values for such similarity scores are hard to determine: ad hoc solutions or estimations on a
training set are often adopted [17, 18]. Typically these techniques are used to scan one sequence of interest against a data-base of known binding sites. While there are historical and practical reasons to consider these two problems as separate, the current post-genomic era, where we are confronted with large abundance of sequence, calls for a different approach. Consider the problem, tackled in [18], of identifying all the the binding sites of the known regulatory proteins in the genome of E. Coli. While formally similar to blasting a small sequence of interest against a data-base of known regulatory proteins, there are substantial differences in these genome-wide search. On the one hand, as one scans through the genome for binding sites of LexA

2003-09-01