590 likes | 699 Views
Patterns of regulatory activity across diverse human cell types predict tissue identity, transcription factor binding, and long-range interactions. http://genome.cshlp.org/content/23/5/777. Extension. http://www.nature.com/nature/journal/v489/n7414/full/nature11232.html.
E N D
Patterns of regulatory activity across diverse human cell types predict tissue identity, transcription factor binding, and long-range interactions http://genome.cshlp.org/content/23/5/777
Extension http://www.nature.com/nature/journal/v489/n7414/full/nature11232.html
DNase I hypersensitive site • DNase I hypersensitive sites (DHSs) are markers of regulatory DNA • Discovery of all classes of cis-regulatory elements including enhancers, promoters, insulators, silencers and locus control regions.
Motivation • Understand the transcriptional regulation • Full account of regulatory elements • Genomic locations • Cell-type specificity • Identify of factors that bind them • Targeted genes
Previous works • Target genes of regulatory elements (REs) • Chromatin conformation capture (3C) and its derivatives to detect long-range chromatin loops • 3D chromatin information is locus and cell-type specific, and resolution is poor • Heuristics • Assign elements to the nearest gene which is bounded by gene boundary. • Mapping methods • Correlations between expression and other genomic features to enable distal linking • This work • Explore the linking of REs with DNase I and matched gene expression data
Overview 2.7 million DNase I hypersensitive sites of 72 cell types Gene expression data Chromatin and expression signal correlation corresponds with known long-range interactions Clustering using self-organizing map 1856 clusters JASPAR motif database Classification using a logistic classifier to predict cell-type lineage with 43 DHS inputs Relations with transcription factors Motif discovery Variation in CpG-island, promoter and conserved element overlap
Part 1 DHSs cluster cell types by biological similarity
DHSs cluster cell types by biological similarity • 2.7 million DHSs from 125 samples • 112 samples with DNase-seq and expression data • 72 unique cell types and 15 unique tissue lineages • 1856 unique clusters using SOM on the DHSs data • 50x50 grid • Merge similar clusters
Cluster color: combination of cell types in which the associated DHSs have high signal in the detailed profile. Square size: # of DHSs assigned
Multi-cell-type clusters • Distant lineage relationships • Reuse of regulatory elements • Transformation related to cancer progression • A limit in the resolution of the SOM Cluster color: combination of cell types in which the associated DHSs have high signal in the detailed profile. Square size: # of DHSs assigned
Part 2 SOM clusters capture variation in CpG-island, promoter, and conserved element overlap
SOM clusters capture variation in CpG-island, promoter, and conserved element overlap • Annotated each SOM cluster of REs w.r.t. overlap with • Promoters • CpG islands • Evolutionarily conserved elements
Distribution of conservation, promoters, and CpG islands across clusters Top 100 DHSs in that cluster (ranked by nearness to the cluster center)
Distribution of conservation, promoters, and CpG islands across clusters Top 100 DHSs in that cluster (ranked by nearness to the cluster center)
Distribution of distance to the transcription start site (TSS) of the nearest gene • DNase I signal profiles of five example clusters, showing the distribution of distance to the transcription start site (TSS) of the nearest gene. • Cluster 99 is promoter rich. • Cluster 1259 is preferentially located in an early intron. • Cluster 199 is highly conserved, but not associated with promoters or CpG islands. • Cluster 881 is primarily distal, with no regions within 500 bp of a TSS.
Distribution of the distance from DHSs to TSS varies Top 100 DHSs in that cluster (ranked by nearness to the cluster center)
Part 3 A logistic classifier predicts cell-type lineage with few DHS input
A logistic classifier predicts cell-type lineage with few DHS input • Some REs are highly specific to certain cell types, so a subset of elements could be used as molecular marker. • Build a multinomial logistic classifier that assigns a probability among multiple classes (tissue lineages) • Each cell type is first assigned to one of the 15 primary tissue types based on biological knowledge • Remove all malignant cell types • Restrict the model to the seven tissue types containing at least four samples each, resulting in a training set of 80 samples across 7 classes.
Feature Selection • Assuming that SOM cluster pattern would be a good candidates for differentiating lineages, • Used an initial feature set consisting of 1856 DHSs • One from each cluster that was most similar to the average profile • Result trained classifier can assign the correct tissue lineage with highest probability (>80% accuracy) in leave-one-out cross-validation. • Only 43DHSs are used as features (minimal) with high tissue specificity that can be used to predict tissue identity
Classification Results • Training data • Presumed origin • Without presumed origin • Sex classifier
Classification Result - Training data • Samples from blood and stem cells were never misclassified.
Classification Result – Unseen data • Classifying the malignant samples as well as the five primary cell types left out of the training model Presumed tissue of origin
Classification Result – Unseen data • Glioblastoma, like astrocytes, originates from glial cells. Cancer progression results in an epithelial-like pattern. Presumed tissue of origin
Classification Result – Unseen data • K562 leukemia cell line is weakly associated with multiple lineages (Pr≤30%) • Similarity to undifferentiated red blood cells and using white blood cells to build the model. Presumed tissue of origin
Part 4 DHS clusters are enriched for known and novel transcription factor motifs
DHS clusters and TF motifs discovery • To find groups of sites with similar activity profiles, which may indicate commonly bound transcription factors (TFs) from the clusters. • Used de novo motif discovery to identify enriched motifs and then assigned motifs to specific factors based on the JASPAR (Portales-Casamar et al. 2010) motif database. • 1279 (69%) clusters had at least one significant motif • 918 (49%) clusters had a motif that could be assigned a factor from a database • Alternatively, 1807 significantly enriched motifs were found (some clusters have multiple motifs), of which 1099 (61%) could be assigned a factor.
Some highly cell-type-specific clusters enriched for motifs known to be important for those cell types. Clusters commonly enriched in a specific cell type did not necessarily share similar motifs, indicating that clusters could discern subtle differences in patterns. TCCAC CANNTG ATW Poorly characterized or unknown TFs not yet present in JASPAR or a complex of TFs
Part 5 Motif discovery in similar hematopoietic clusters reveals subtle motif differences
Detected IRF1/IRF2/SPI1-like motifs predominantly in clusters specific to hematopoietic cell lineages, • Variation in DNase I signal intensity among • LCLs • B cell leukemia (CLL) • T cells (CD4, Jurkat, and Th) • megakaryocytes (CMK) • erythroleukemia (K562). Variations in IRF-like motifs in hematopoietic clusters
Possible explanation • Slight variations on the motifs accompanying differences in DNase I signal across hematopoietic cell types. • Differences between IRFs and SPI1 binding • Different cofactors that modulate an IRF's binding preference • Distinct IRFs in specific hematopoietic lineages. • These motif variations represent biological differences in motif preference rather than statistical noise because in other cases (e.g., in the case of CTCF), because they see less variation among discovered motifs across clusters. • We also see similar patterns when looking at an independent set of regions from the same clusters.
Part 6 Motif discovery results are consistent with experimental ChIP data
Motif discovery results are consistent with experimental ChIP data • ChIP data from the ENCODE project to validate discovered motifs • Using representative DHSs from each cluster with enriched motifs, we compared overlap with ChIP peaks from 43 experiments. • Incongruence in overlap between motif and ChIP results • ChIP data come from only a subset of cell types included in the motif analysis. • For example, we compared ChIP results for a single IRF from just three cell types, while our motif analysis considered 14 hematopoietic lineages. Without ChIP data for all cell types, we expect to find many instances of a positive motif result without a corresponding ChIP signal. • Additionally, ChIP reports signal at indirectly bound sites where a motif would not.
Probably due to its cross-cell-type consistency IRFs, SPI1 and RUNX1 are coregulating hematopoietic lineages (Huang et al. 2008). SP1 is a general, promoter-enriched factor with many interacting partners (Kaczynski et al. 2003). There is good correspondence (Mann-Whitney P-values between 10−5 and 10−133) between motif enrichment and ChIP results.
Part 7 Global transcription factor trends suggest AP-1 is a chromatin-accessibility factor
Global transcription factor trends suggest AP-1 is a chromatin-accessibility factor • To know whether individual TFs whose motifs are present in several clusters revealed biologically interesting properties about their function. • For each TF, we summarized motif results from all clusters and identified lineage trends. The cell-type specificity for selected motifs motif Biologically relevant tissue
Global transcription factor trends suggest AP-1 is a chromatin-accessibility factor • To characterize the regulatory elements that bind each factor. • Examined the CpG-content, genomic location, and tissue specificity of clusters where each TF motif was enriched
Part 8 Chromatin and expression signal correlation corresponds with known long-range interaction
Identifying target genes for DHSs • If the pattern of a DNase-seq signal across cell types matched the pattern of expression of a gene across cell types, this provided evidence that the gene is a regulatory target of the DHS. • Correlate DHS with gene expression data to infer the target genes (both protein-coding and RNA) for each of the ~2.7M DHSs. Limitations: # of cell types, high-order effects
Findings • About 530k (20%) DHSs correlated significantly with at least one gene within 100kb (permutation P-value < 0.05) • A significant enrichment over the 5% expected by chance • 71% correlate with a single gene but some correlate with as many as 44 genes • 31k Ensembl genes (98%) correlated with at least one DHS • Median 19 • Protein-coding genes tended to have more associations than RNA gens
Correlation between DHS and expression Genes Tie-plot showing the top 50 connections at the beta-globin locus DHSs Tie-plot for the H19/IGF2 locus Red marks below indicate DHSs. Blue bars above represent genes. Connecting lines represent significant correlations, where the width of the lines is proportional to the correlation strength. Far away and crossing multiple gene boundaries
Web Resource • Query, display and extract data • Create a genome browser • http://dnase.genome.duke.edu
Conclusion 2.7 million DNase I hypersensitive sites of 72 cell types Gene expression data Chromatin and expression signal correlation corresponds with known long-range interactions Clustering using self-organizing map 1856 clusters JASPAR motif database Classification using a logistic classifier to predict cell-type lineage with 43 DHS inputs Relations with transcription factors Motif discovery Variation in CpG-island, promoter and conserved element overlap
Contribution • The authors integrated chromatin accessibility and expression data from many human cell types. • They used the ENCODE DNase-seq data and clustered more than 2 million DHSs from 112 diverse biological samples by tissue specificity into 1856 chromatin profiles and found each cluster to have a distinct bias relative to • Location • Evoluaionary conservation • CpG islands • Promoter proximity
Contribution • Gene expression profiling + regulatory information • Cell types classification • Assigned 112 samples into tissue groups and developed classifiers to assign tissue type based on Dnase I hypersensitivity patters across the cell-type groups. • Prediction accuracy > 80% in leave-one-out experiments • Similarly, applied on lineage of cancer cell types and sex-specific DHSs
DNase-seq assays identify > 100,000 active Res but do not know the TFs identity • De novo motif discovery
Chromatin and expression signal correlation corresponds with known long-range interaction • Identifying target genes for DHSs • Cross-cell-type correlation among DHSs to identify blocks of similar regulatory elements and coexpressed genes • Correlating distal DHSs with promoter DHSs.