A Hidden Markov Model for Analyzing ChIP-chip Experiments on Genome Tiling Arrays
and its Application to p53 Binding Sequences
Wei Li, Clifford A. Meyer, X. Shirley Liu*
Department of Biostatistics and Computational Biology, Dana-Farber Cancer Institute, Harvard School of Public Health, Boston, MA 02115, USA
Abstract
Motivation: Transcription factors (TFs) regulate gene expression by recognizing and binding to specific regulatory regions on the genome, which in higher eukaryotes can occur far away from the regulated genes. Recently Affymetrix developed the high-density oligonucleotide arrays that tile all the non-repetitive sequences of the human genome at 35-bp resolution. This new array platform allows for the unbiased mapping of in vivo TF binding sequences (TFBSs) using Chromatin ImmunoPrecipitation followed by microarray experiments (ChIP-chip). The massive data set generated from these experiments pose great challenges for data analysis.
Results: We developed a fast, scalable and sensitive method to extract TFBSs from ChIP-chip experiments on genome tiling arrays. Our method takes advantage of tiling array data from many experiments to normalize and model the behavior of each individual probe, and identifies TFBSs using a Hidden Markov Model (HMM). When applied to the data of p53 ChIP-chip experiments (Cawley et al., 2004), our method discovered many new high confidence p53 targets including all the regions verified by quantitative PCR . Using a de novo motif finding algorithm MDscan (Liu et al., 2002), we also recovered the p53 motif from our HMM identified p53 target regions. Furthermore, we found substantial p53 motif enrichment in these regions comparing with both genomic background and the TFBSs identified by Cawley et al (2004). Several of the newly identified p53 TFBSs are in known genes’ promoter regions or associated with previously characterized p53-responsive genes.
Contact: xsliu@jimmy.harvard.edu
Supplementary materials:
Table 1. List of p53 TFBSs identified by Hidden Markov Model and Cawley et al., (2004)
Download
data in UCSC
BED
format*
fully-repeat-masked sequences:
p53.HMM-High-Confidence#
p53.HMM-Full#
* score = (average HMM enrichment
score) * 50 + 500
In UCSC
BED format, if the track line useScore attribute is set to 1 for this
annotation data set, the score value (between 0 and 1000) will determine the
level of gray in which this feature is displayed (higher numbers = darker gray).
The level of gray is the same for all score values
≥ 1000.
# Interspersed repeats were masked as 'N', tandem repeats were
masked as 'n' or 'N'.
Download HMM.Tiling source code