A Hidden Markov Model for Analyzing ChIP-chip Experiments on Genome Tiling Arrays

and its Application to p53 Binding Sequences

Wei Li, Clifford A. Meyer, X. Shirley Liu*

Department of Biostatistics and Computational Biology, Dana-Farber Cancer Institute, Harvard School of Public Health, Boston, MA 02115, USA

Abstract

Motivation: Transcription factors (TFs) regulate gene expression by recognizing and binding to specific regulatory regions on the genome, which in higher eukaryotes can occur far away from the regulated genes. Recently Affymetrix developed the high-density oligonucleotide arrays that tile all the non-repetitive sequences of the human genome at 35-bp resolution. This new array platform allows for the unbiased mapping of in vivo TF binding sequences (TFBSs) using Chromatin ImmunoPrecipitation followed by microarray experiments (ChIP-chip). The massive data set generated from these experiments pose great challenges for data analysis.

Results: We developed a fast, scalable and sensitive method to extract TFBSs from ChIP-chip experiments on genome tiling arrays. Our method takes advantage of tiling array data from many experiments to normalize and model the behavior of each individual probe, and identifies TFBSs using a Hidden Markov Model (HMM). When applied to the data of p53 ChIP-chip experiments (Cawley et al., 2004), our method discovered many new high confidence p53 targets including all the regions verified by quantitative PCR . Using a de novo motif finding algorithm MDscan (Liu et al., 2002), we also recovered the p53 motif from our HMM identified p53 target regions. Furthermore, we found substantial p53 motif enrichment in these regions comparing with both genomic background and the TFBSs identified by Cawley et al (2004). Several of the newly identified p53 TFBSs are in known genes’ promoter regions or associated with previously characterized p53-responsive genes.

Contact: xsliu@jimmy.harvard.edu

Supplementary materials:

Table 1. List of p53 TFBSs identified by Hidden Markov Model and Cawley et al., (2004)

* score = (average HMM enrichment score) * 50 + 500
In UCSC BED format, if the track line useScore attribute is set to 1 for this annotation data set, the score value (between 0 and 1000) will determine the level of gray in which this feature is displayed (higher numbers = darker gray). The level of gray is the same for all score values 1000.

# Interspersed repeats were masked as 'N', tandem repeats were masked as 'n' or 'N'.

Download HMM.Tiling source code