ChIP-seq和ATAC-seq的流程性分析之后,我们都会得到Enriched Peaks,下一步就是挖掘这些基因组区间的生物学意义,包括富集的生物通路、富集的其他已知的基因组区间注释等。
GREAT(Genomic Regions Enrichment of Annotations Tool)是一个基因组区间注释和富集分析的工具,详细信息可阅读参考文献,作者也开发了在线工具。GREAT是一个用于分析基因组区域与生物学功能注释之间富集关系的工具。它的基本原理是基于这样一个假设:基因组的非编码区域与其附近基因的功能相关,因此可以通过分析这些区域的功能富集来推断潜在的生物学意义。我们要练习的是R版本rGREAT.
为了能够对基因组区间进行功能富集分析,需要将他们和基因之间关联起来。GREAT工具提供了三种定义基因调控区域的方法。
默认的方法是Basal plus extension
,这种方法有效地结合了局部调控(靠近TSS的调控)和远端调控(例如增强子),从而可以更全面地评估基因与基因组区间的关系。(1)每个基因首先被分配一个基础调控区,即转录起始位点(TSS)上游和下游的一个固定距离范围。这一固定的距离通常为上游5 kb到下游1 kb,或根据用户设置的特定范围。(2)如果基因的基础调控区域没有与其他基因的基础调控区域重叠,GREAT会进一步将该基因的调控区延伸到两侧的基因组区域,直到达到一定的距离阈值(例如 1 Mb),或者遇到相邻基因的调控区域。
在前面的课程中,我们分析了一套ChIP-seq数据,包括GSM3112103(PU.1 ChIP-seq)和GSM3112107(Input),并产生了Narrow Peaks文件PU1_peaks.narrowPeak
。如果你自己没有产生,也可以直接将培训的文件链接到自己的目录:
ln -s /sibcb1/bioinformatics/biocourse/biocourse01/epigenomics/macs2 .
suppressPackageStartupMessages(library(ChIPseeker))
peak <- readPeakFile('macs2/PU1_peaks.narrowPeak')
# 按照`macs2/PU1_peaks.xls`的注释,V8是-log10(pvalue)
peak = peak[order(-peak$V8)]
# 选取最显著的1000条作为例子
gr = peak[1:1000]
head(gr)
GRanges object with 6 ranges and 7 metadata columns:
seqnames ranges strand | V4 V5
<Rle> <IRanges> <Rle> | <character> <integer>
[1] chr8 11858591-11859220 * | PU1_peak_115251 4096
[2] chr8 104329491-104330329 * | PU1_peak_120054 4077
[3] chr16 21790409-21791003 * | PU1_peak_43541 4005
[4] chr6 146595779-146596463 * | PU1_peak_106445 3968
[5] chr3 194736448-194737879 * | PU1_peak_82766 3958
[6] chr1 161185161-161186482 * | PU1_peak_8656 3914
V6 V7 V8 V9 V10
<character> <numeric> <numeric> <numeric> <integer>
[1] . 118.523 419.103 409.612 283
[2] . 117.953 416.663 407.774 433
[3] . 109.006 408.306 400.571 296
[4] . 105.868 404.343 396.825 363
[5] . 119.895 403.294 395.828 964
[6] . 110.192 398.752 391.470 907
-------
seqinfo: 87 sequences from an unspecified genome; no seqlengths
测试GO中BP的富集:
res <- great(gr, "GO:BP", "TxDb.Hsapiens.UCSC.hg38.knownGene")
tbl <- getEnrichmentTables(res)
head(tbl)
id description genome_fraction
1 GO:0002376 immune system process 0.17889065
2 GO:0006955 immune response 0.11138071
3 GO:0006950 response to stress 0.28083415
4 GO:0045321 leukocyte activation 0.08249659
5 GO:0001775 cell activation 0.09419635
6 GO:0002682 regulation of immune system process 0.11892630
observed_region_hits fold_enrichment p_value p_adjust
1 294 1.648407 0.000000e+00 0.000000e+00
2 192 1.729004 4.640732e-14 7.654888e-11
3 387 1.382184 1.746381e-13 1.920437e-10
4 151 1.835886 5.142553e-13 4.241321e-10
5 162 1.724987 8.459344e-12 4.739438e-09
6 193 1.627737 8.619772e-12 4.739438e-09
mean_tss_dist observed_gene_hits gene_set_size
1 154832 259 2242
2 127174 169 1517
3 143573 372 3574
4 196384 121 868
5 199195 131 1001
6 167198 164 1357
fold_enrichment_hyper p_value_hyper p_adjust_hyper
1 1.435421 3.313373e-10 5.753062e-08
2 1.384255 5.935014e-06 2.276699e-04
3 1.293312 1.116913e-08 1.116574e-06
4 1.732131 1.214759e-09 1.683719e-07
5 1.626119 1.345341e-08 1.305376e-06
6 1.501686 5.192630e-08 4.282621e-06
也可以使用MSigDB中的pathway:
msigdb:H Hallmark gene sets.
msigdb:C1 Positional gene sets.
msigdb:C2 Curated gene sets.
msigdb:C2:CGP C2 subcategory: chemical and genetic perturbations gene sets.
msigdb:C2:CP C2 subcategory: canonical pathways gene sets.
msigdb:C2:CP:KEGG C2 subcategory: KEGG subset of CP.
msigdb:C2:CP:PID C2 subcategory: PID subset of CP.
msigdb:C2:CP:REACTOME C2 subcategory: REACTOME subset of CP.
msigdb:C2:CP:WIKIPATHWAYS C2 subcategory: WIKIPATHWAYS subset of CP.
msigdb:C3 Regulatory target gene sets.
msigdb:C3:MIR:MIRDB miRDB of microRNA targets gene sets.
msigdb:C3:MIR:MIR_LEGACY MIR_Legacy of MIRDB.
msigdb:C3:TFT:GTRD GTRD transcription factor targets gene sets.
msigdb:C3:TFT:TFT_LEGACY TFT_Legacy.
msigdb:C4 Computational gene sets.
msigdb:C4:CGN C4 subcategory: cancer gene neighborhoods gene sets.
msigdb:C4:CM C4 subcategory: cancer modules gene sets.
msigdb:C5 Ontology gene sets.
msigdb:C5:GO:BP C5 subcategory: BP subset.
msigdb:C5:GO:CC C5 subcategory: CC subset.
msigdb:C5:GO:MF C5 subcategory: MF subset.
msigdb:C5:HPO C5 subcategory: human phenotype ontology gene sets.
msigdb:C6 Oncogenic signature gene sets.
msigdb:C7 Immunologic signature gene sets.
msigdb:C7:IMMUNESIGDB ImmuneSigDB subset of C7.
msigdb:C7:VAX C7 subcategory: vaccine response gene sets.
msigdb:C8 Cell type signature gene sets.
这里我们试一下其中的KEGG pathway:
res <- great(gr, "msigdb:C2:CP:KEGG", "TxDb.Hsapiens.UCSC.hg38.knownGene")
tbl <- getEnrichmentTables(res)
head(tbl)
id
1 KEGG_LEISHMANIA_INFECTION
2 KEGG_CYTOKINE_CYTOKINE_RECEPTOR_INTERACTION
3 KEGG_GLYCOSPHINGOLIPID_BIOSYNTHESIS_LACTO_AND_NEOLACTO_SERIES
4 KEGG_LYSOSOME
5 KEGG_ANTIGEN_PROCESSING_AND_PRESENTATION
6 KEGG_NATURAL_KILLER_CELL_MEDIATED_CYTOTOXICITY
genome_fraction observed_region_hits fold_enrichment p_value
1 0.005920726 18 3.049316 4.328319e-05
2 0.019841248 39 1.971517 7.229254e-05
3 0.002402115 10 4.175524 1.930790e-04
4 0.007024225 18 2.570271 3.430254e-04
5 0.002070474 8 3.875477 1.315561e-03
6 0.007274593 17 2.343932 1.326864e-03
p_adjust mean_tss_dist observed_gene_hits gene_set_size
1 0.003036287 204028 14 55
2 0.003036287 151945 37 241
3 0.005406212 40847 6 23
4 0.007203534 64506 10 115
5 0.017729255 194328 4 45
6 0.017729255 102012 18 112
fold_enrichment_hyper p_value_hyper p_adjust_hyper
1 3.162863 8.432574e-05 0.004426847
2 1.907654 1.054011e-04 0.004426847
3 3.241443 8.230194e-03 0.114070090
4 1.080481 4.476034e-01 0.662162547
5 1.104492 4.950245e-01 0.662162547
6 1.996960 3.556449e-03 0.074685430
我们还能获得Genomic Regions和基因之间的对应关系:
getRegionGeneAssociations(res)
GRanges object with 992 ranges and 9 metadata columns:
seqnames ranges strand | V4
<Rle> <IRanges> <Rle> | <character>
[1] chr8 11858591-11859220 * | PU1_peak_115251
[2] chr8 104329491-104330329 * | PU1_peak_120054
[3] chr16 21790409-21791003 * | PU1_peak_43541
[4] chr6 146595779-146596463 * | PU1_peak_106445
[5] chr3 194736448-194737879 * | PU1_peak_82766
... ... ... ... . ...
[988] chr12 116857970-116858462 * | PU1_peak_30204
[989] chr2 181850775-181852119 * | PU1_peak_64727
[990] chr4 2770650-2771300 * | PU1_peak_83142
[991] chr4 6938094-6938682 * | PU1_peak_83302
[992] chr5 43338830-43339452 * | PU1_peak_92074
V5 V6 V7 V8 V9 V10
<integer> <character> <numeric> <numeric> <numeric> <integer>
[1] 4096 . 118.523 419.103 409.612 283
[2] 4077 . 117.953 416.663 407.774 433
[3] 4005 . 109.006 408.306 400.571 296
[4] 3968 . 105.868 404.343 396.825 363
[5] 3958 . 119.895 403.294 395.828 964
... ... ... ... ... ... ...
[988] 1913 . 60.1587 195.888 191.332 194
[989] 1913 . 60.1587 195.888 191.332 357
[990] 1913 . 60.1587 195.888 191.332 297
[991] 1913 . 60.1587 195.888 191.332 309
[992] 1913 . 60.1587 195.888 191.332 340
annotated_genes dist_to_TSS
<CharacterList> <IntegerList>
[1] DEFB130B 212527
[2] SLC25A32,DCSTAMP -827596,-8758
[3] CRYM,VWA3A -487326,-301535
[4] ADGB -2504
[5] FAM43A,XXYLT1 50565,533280
... ... ...
[988] SPRING1,HRK -119900,22979
[989] NEUROD1,ITPRID2 -169948,-39611
[990] TNIP2,SH3BP2 -14308,-21771
[991] TBC1D14,TADA2B 28852,-103217
[992] HMGCS1,CCL28 -25318,72939
-------
seqinfo: 87 sequences from an unspecified genome; no seqlengths
plotRegionGeneAssociations(res)
LOLA(https://code.databio.org/LOLA/) 是一种用于评估基因组区域集合之间重叠显著性的方法。其核心原理确实基于Fisher Exact Test,可阅读参考文献获取更详细的信息。
LOLA的基本分析方法可用上图来描述。我们要分析的区间集叫做Query Set
,它全部落在背景Universe Set
中,要测试的是Query Set
和知识库中的Region Set
是否有显著性的重合。为了衡量显著性,需要计算四个值:Both Support (a)、Query ONLY (b)、Database ONLY (c)、Neither (d)。利用这四个值就可以使用Fisher’s Exact Test计算出显著性。
我们将使用前面课程中的ATAC-seq数据作为测试数据。我们尝试了两种方法进行Peak Calling,并比较了Treatment和Control之间的差别,分别得到了三个文件:
ls -la macs2_atac | grep -v summits | grep bed
-rw-r--r-- 1 biocourse01 ods 197711 Sep 11 14:13 C_unique.bed
-rw-r--r-- 1 biocourse01 ods 412527 Sep 11 14:13 TC_common.bed
-rw-r--r-- 1 biocourse01 ods 177027 Sep 11 14:13 T_unique.bed
ls -la macs2_atac_PE | grep -v summits | grep bed
-rw-r--r-- 1 biocourse01 ods 359519 Sep 11 15:07 C_unique.bed
-rw-r--r-- 1 biocourse01 ods 523892 Sep 11 15:07 TC_common.bed
-rw-r--r-- 1 biocourse01 ods 336977 Sep 11 15:07 T_unique.bed
为了练习的方便,我们可以将这些文件链接到自己的目录下:
ln -s /sibcb1/bioinformatics/biocourse/biocourse01/epigenomics/macs2_atac .
ln -s /sibcb1/bioinformatics/biocourse/biocourse01/epigenomics/macs2_atac_PE .
ln -s /sibcb1/bioinformatics/biocourse/biocourse01/epigenomics/db .
需要准备一个文件夹,比如PE
,其中包含regions
子文件夹,该文件夹中是BED文件,每个文件代表一个Region Set
。同时,还需要一个index.txt
文件,包含每个BED文件的文件名和描述,比如:
head db/Knowledge/PE/index.txt
filename description
C_unique.bed Control unique
T_unique.bed Treatment unique
TC_common.bed Shared
我们来检测一下两种方法做出来是否有显著性重合。
suppressPackageStartupMessages(library(ChIPseeker))
peaks = loadRegionDB('db/Query', useCache = TRUE)
regionDB = loadRegionDB('db/Knowledge', useCache = TRUE)
regionDB[[2]]
Index: <collection>
filename cellType description tissue dataSource antibody
<char> <char> <char> <char> <char> <char>
1: C_unique.bed <NA> Control unique <NA> <NA> <NA>
2: TC_common.bed <NA> Shared <NA> <NA> <NA>
3: T_unique.bed <NA> Treatment unique <NA> <NA> <NA>
treatment collection size
<char> <char> <num>
1: <NA> PE 7317
2: <NA> PE 10510
3: <NA> PE 6840
userUniverse = reduce(unlist(peaks[[4]]))
# 检验C_unique
peaks[[4]][[1]]
GRanges object with 4041 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
C_unique.bed1 chr1 6448054-6449305 *
C_unique.bed2 chr1 6984157-6984946 *
C_unique.bed3 chr1 7671600-7672198 *
C_unique.bed4 chr1 8581664-8582488 *
C_unique.bed5 chr1 8946701-8947290 *
... ... ... ...
C_unique.bed4037 chrX 165158152-165158714 *
C_unique.bed4038 chrX 166528115-166528621 *
C_unique.bed4039 chrX 166618602-166619852 *
C_unique.bed4040 chrY 90804237-90808037 *
C_unique.bed4041 chrY 90809741-90810441 *
-------
seqinfo: 21 sequences from an unspecified genome; no seqlengths
runLOLA(peaks[[4]][[1]], userUniverse, regionDB, cores=1)
Key: <dbSet>
userSet dbSet collection pValueLog oddsRatio support rnkPV
<int> <int> <char> <num> <num> <int> <int>
1: 1 1 PE 322.0052 134.16287058 3499 1
2: 1 2 PE 0.0000 0.03525545 204 2
3: 1 3 PE 0.0000 0.00000000 0 2
rnkOR rnkSup maxRnk meanRnk b c d description
<int> <int> <int> <num> <int> <int> <int> <char>
1: 1 1 1 1.00 529 542 11009 Control unique
2: 2 2 2 2.00 6938 3837 4600 Shared
3: 3 3 3 2.67 3676 4041 7862 Treatment unique
cellType tissue antibody treatment dataSource filename qValue
<char> <char> <char> <char> <char> <char> <num>
1: <NA> <NA> <NA> <NA> <NA> C_unique.bed 0
2: <NA> <NA> <NA> <NA> <NA> TC_common.bed 1
3: <NA> <NA> <NA> <NA> <NA> T_unique.bed 1
size
<num>
1: 7317
2: 10510
3: 6840
# 检验TC_common
peaks[[4]][[2]]
GRanges object with 8286 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
TC_common.bed1 chr1 3671665-3672385 *
TC_common.bed2 chr1 4491978-4493829 *
TC_common.bed3 chr1 4660138-4660935 *
TC_common.bed4 chr1 4785362-4785923 *
TC_common.bed5 chr1 4802561-4803297 *
... ... ... ...
TC_common.bed8282 chrY 90803513-90804673 *
TC_common.bed8283 chrY 90805874-90807224 *
TC_common.bed8284 chrY 90807763-90813258 *
TC_common.bed8285 chrY 90824996-90825596 *
TC_common.bed8286 chrY 90828388-90830153 *
-------
seqinfo: 21 sequences from an unspecified genome; no seqlengths
runLOLA(peaks[[4]][[2]], userUniverse, regionDB, cores=1)
userSet dbSet collection pValueLog oddsRatio support rnkPV
<int> <int> <char> <num> <num> <int> <int>
1: 1 2 PE 322.0052 261.28768977 6994 1
2: 1 1 PE 0.0000 0.12691658 770 2
3: 1 3 PE 0.0000 0.05526242 364 2
rnkOR rnkSup maxRnk meanRnk b c d description
<int> <int> <int> <num> <int> <int> <int> <char>
1: 1 1 1 1.00 148 1292 7145 Shared
2: 2 2 2 2.00 3258 7516 4035 Control unique
3: 3 3 3 2.67 3312 7922 3981 Treatment unique
cellType tissue antibody treatment dataSource filename qValue
<char> <char> <char> <char> <char> <char> <num>
1: <NA> <NA> <NA> <NA> <NA> TC_common.bed 0
2: <NA> <NA> <NA> <NA> <NA> C_unique.bed 1
3: <NA> <NA> <NA> <NA> <NA> T_unique.bed 1
size
<num>
1: 10510
2: 7317
3: 6840
# 检验T_unique
peaks[[4]][[3]]
GRanges object with 3599 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
T_unique.bed1 chr1 3175366-3176112 *
T_unique.bed2 chr1 3934241-3934936 *
T_unique.bed3 chr1 5250694-5251311 *
T_unique.bed4 chr1 5820048-5820594 *
T_unique.bed5 chr1 7487947-7488620 *
... ... ... ...
T_unique.bed3595 chrX 157380362-157381192 *
T_unique.bed3596 chrX 159516419-159517041 *
T_unique.bed3597 chrY 6769716-6770455 *
T_unique.bed3598 chrY 89126019-89126623 *
T_unique.bed3599 chrY 90738653-90741018 *
-------
seqinfo: 21 sequences from an unspecified genome; no seqlengths
runLOLA(peaks[[4]][[3]], userUniverse, regionDB, cores=1)
userSet dbSet collection pValueLog oddsRatio support rnkPV
<int> <int> <char> <num> <num> <int> <int>
1: 1 3 PE 322.0052 612.82476996 3380 1
2: 1 2 PE 0.0000 0.01354958 69 2
3: 1 1 PE 0.0000 0.00000000 0 2
rnkOR rnkSup maxRnk meanRnk b c d description
<int> <int> <int> <num> <int> <int> <int> <char>
1: 1 1 1 1.00 296 219 11684 Treatment unique
2: 2 2 2 2.00 7073 3530 4907 Shared
3: 3 3 3 2.67 4028 3599 7952 Control unique
cellType tissue antibody treatment dataSource filename qValue
<char> <char> <char> <char> <char> <char> <num>
1: <NA> <NA> <NA> <NA> <NA> T_unique.bed 0
2: <NA> <NA> <NA> <NA> <NA> TC_common.bed 1
3: <NA> <NA> <NA> <NA> <NA> C_unique.bed 1
size
<num>
1: 6840
2: 10510
3: 7317