Bioinformatics training course: epigenomics: Enrichment Analysis of Genomic Regions

ChIP-seq和ATAC-seq的流程性分析之后，我们都会得到Enriched Peaks，下一步就是挖掘这些基因组区间的生物学意义，包括富集的生物通路、富集的其他已知的基因组区间注释等。

GREAT

GREAT（Genomic Regions Enrichment of Annotations Tool）是一个基因组区间注释和富集分析的工具，详细信息可阅读参考文献，作者也开发了在线工具。GREAT是一个用于分析基因组区域与生物学功能注释之间富集关系的工具。它的基本原理是基于这样一个假设：基因组的非编码区域与其附近基因的功能相关，因此可以通过分析这些区域的功能富集来推断潜在的生物学意义。我们要练习的是R版本rGREAT.

基本原理

基因组区间与基因的关联

为了能够对基因组区间进行功能富集分析，需要将他们和基因之间关联起来。GREAT工具提供了三种定义基因调控区域的方法。

默认的方法是Basal plus extension，这种方法有效地结合了局部调控（靠近TSS的调控）和远端调控（例如增强子），从而可以更全面地评估基因与基因组区间的关系。（1）每个基因首先被分配一个基础调控区，即转录起始位点（TSS）上游和下游的一个固定距离范围。这一固定的距离通常为上游5 kb到下游1 kb，或根据用户设置的特定范围。（2）如果基因的基础调控区域没有与其他基因的基础调控区域重叠，GREAT会进一步将该基因的调控区延伸到两侧的基因组区域，直到达到一定的距离阈值（例如 1 Mb），或者遇到相邻基因的调控区域。

统计模型一：超几何分布

统计模型二：二项分布

分析流程

测试数据

在前面的课程中，我们分析了一套ChIP-seq数据，包括GSM3112103（PU.1 ChIP-seq）和GSM3112107（Input），并产生了Narrow Peaks文件PU1_peaks.narrowPeak。如果你自己没有产生，也可以直接将培训的文件链接到自己的目录：

ln -s /sibcb1/bioinformatics/biocourse/biocourse01/epigenomics/macs2 .

suppressPackageStartupMessages(library(ChIPseeker))
peak <- readPeakFile('macs2/PU1_peaks.narrowPeak')
# 按照`macs2/PU1_peaks.xls`的注释，V8是-log10(pvalue)
peak = peak[order(-peak$V8)]
# 选取最显著的1000条作为例子
gr = peak[1:1000]
head(gr)

GRanges object with 6 ranges and 7 metadata columns:
      seqnames              ranges strand |              V4        V5
         <Rle>           <IRanges>  <Rle> |     <character> <integer>
  [1]     chr8   11858591-11859220      * | PU1_peak_115251      4096
  [2]     chr8 104329491-104330329      * | PU1_peak_120054      4077
  [3]    chr16   21790409-21791003      * |  PU1_peak_43541      4005
  [4]     chr6 146595779-146596463      * | PU1_peak_106445      3968
  [5]     chr3 194736448-194737879      * |  PU1_peak_82766      3958
  [6]     chr1 161185161-161186482      * |   PU1_peak_8656      3914
               V6        V7        V8        V9       V10
      <character> <numeric> <numeric> <numeric> <integer>
  [1]           .   118.523   419.103   409.612       283
  [2]           .   117.953   416.663   407.774       433
  [3]           .   109.006   408.306   400.571       296
  [4]           .   105.868   404.343   396.825       363
  [5]           .   119.895   403.294   395.828       964
  [6]           .   110.192   398.752   391.470       907
  -------
  seqinfo: 87 sequences from an unspecified genome; no seqlengths

测试GO中BP的富集：

res <- great(gr, "GO:BP", "TxDb.Hsapiens.UCSC.hg38.knownGene")
tbl <- getEnrichmentTables(res)
head(tbl)

          id                         description genome_fraction
1 GO:0002376               immune system process      0.17889065
2 GO:0006955                     immune response      0.11138071
3 GO:0006950                  response to stress      0.28083415
4 GO:0045321                leukocyte activation      0.08249659
5 GO:0001775                     cell activation      0.09419635
6 GO:0002682 regulation of immune system process      0.11892630
  observed_region_hits fold_enrichment      p_value     p_adjust
1                  294        1.648407 0.000000e+00 0.000000e+00
2                  192        1.729004 4.640732e-14 7.654888e-11
3                  387        1.382184 1.746381e-13 1.920437e-10
4                  151        1.835886 5.142553e-13 4.241321e-10
5                  162        1.724987 8.459344e-12 4.739438e-09
6                  193        1.627737 8.619772e-12 4.739438e-09
  mean_tss_dist observed_gene_hits gene_set_size
1        154832                259          2242
2        127174                169          1517
3        143573                372          3574
4        196384                121           868
5        199195                131          1001
6        167198                164          1357
  fold_enrichment_hyper p_value_hyper p_adjust_hyper
1              1.435421  3.313373e-10   5.753062e-08
2              1.384255  5.935014e-06   2.276699e-04
3              1.293312  1.116913e-08   1.116574e-06
4              1.732131  1.214759e-09   1.683719e-07
5              1.626119  1.345341e-08   1.305376e-06
6              1.501686  5.192630e-08   4.282621e-06

也可以使用MSigDB中的pathway：

msigdb:H Hallmark gene sets.
msigdb:C1 Positional gene sets.
msigdb:C2 Curated gene sets.
msigdb:C2:CGP C2 subcategory: chemical and genetic perturbations gene sets.
msigdb:C2:CP C2 subcategory: canonical pathways gene sets.
msigdb:C2:CP:KEGG C2 subcategory: KEGG subset of CP.
msigdb:C2:CP:PID C2 subcategory: PID subset of CP.
msigdb:C2:CP:REACTOME C2 subcategory: REACTOME subset of CP.
msigdb:C2:CP:WIKIPATHWAYS C2 subcategory: WIKIPATHWAYS subset of CP.
msigdb:C3 Regulatory target gene sets.
msigdb:C3:MIR:MIRDB miRDB of microRNA targets gene sets.
msigdb:C3:MIR:MIR_LEGACY MIR_Legacy of MIRDB.
msigdb:C3:TFT:GTRD GTRD transcription factor targets gene sets.
msigdb:C3:TFT:TFT_LEGACY TFT_Legacy.
msigdb:C4 Computational gene sets.
msigdb:C4:CGN C4 subcategory: cancer gene neighborhoods gene sets.
msigdb:C4:CM C4 subcategory: cancer modules gene sets.
msigdb:C5 Ontology gene sets.
msigdb:C5:GO:BP C5 subcategory: BP subset.
msigdb:C5:GO:CC C5 subcategory: CC subset.
msigdb:C5:GO:MF C5 subcategory: MF subset.
msigdb:C5:HPO C5 subcategory: human phenotype ontology gene sets.
msigdb:C6 Oncogenic signature gene sets.
msigdb:C7 Immunologic signature gene sets.
msigdb:C7:IMMUNESIGDB ImmuneSigDB subset of C7.
msigdb:C7:VAX C7 subcategory: vaccine response gene sets.
msigdb:C8 Cell type signature gene sets.

这里我们试一下其中的KEGG pathway：

res <- great(gr, "msigdb:C2:CP:KEGG", "TxDb.Hsapiens.UCSC.hg38.knownGene")
tbl <- getEnrichmentTables(res)
head(tbl)

                                                             id
1                                     KEGG_LEISHMANIA_INFECTION
2                   KEGG_CYTOKINE_CYTOKINE_RECEPTOR_INTERACTION
3 KEGG_GLYCOSPHINGOLIPID_BIOSYNTHESIS_LACTO_AND_NEOLACTO_SERIES
4                                                 KEGG_LYSOSOME
5                      KEGG_ANTIGEN_PROCESSING_AND_PRESENTATION
6                KEGG_NATURAL_KILLER_CELL_MEDIATED_CYTOTOXICITY
  genome_fraction observed_region_hits fold_enrichment      p_value
1     0.005920726                   18        3.049316 4.328319e-05
2     0.019841248                   39        1.971517 7.229254e-05
3     0.002402115                   10        4.175524 1.930790e-04
4     0.007024225                   18        2.570271 3.430254e-04
5     0.002070474                    8        3.875477 1.315561e-03
6     0.007274593                   17        2.343932 1.326864e-03
     p_adjust mean_tss_dist observed_gene_hits gene_set_size
1 0.003036287        204028                 14            55
2 0.003036287        151945                 37           241
3 0.005406212         40847                  6            23
4 0.007203534         64506                 10           115
5 0.017729255        194328                  4            45
6 0.017729255        102012                 18           112
  fold_enrichment_hyper p_value_hyper p_adjust_hyper
1              3.162863  8.432574e-05    0.004426847
2              1.907654  1.054011e-04    0.004426847
3              3.241443  8.230194e-03    0.114070090
4              1.080481  4.476034e-01    0.662162547
5              1.104492  4.950245e-01    0.662162547
6              1.996960  3.556449e-03    0.074685430

我们还能获得Genomic Regions和基因之间的对应关系：

getRegionGeneAssociations(res)

GRanges object with 992 ranges and 9 metadata columns:
        seqnames              ranges strand |              V4
           <Rle>           <IRanges>  <Rle> |     <character>
    [1]     chr8   11858591-11859220      * | PU1_peak_115251
    [2]     chr8 104329491-104330329      * | PU1_peak_120054
    [3]    chr16   21790409-21791003      * |  PU1_peak_43541
    [4]     chr6 146595779-146596463      * | PU1_peak_106445
    [5]     chr3 194736448-194737879      * |  PU1_peak_82766
    ...      ...                 ...    ... .             ...
  [988]    chr12 116857970-116858462      * |  PU1_peak_30204
  [989]     chr2 181850775-181852119      * |  PU1_peak_64727
  [990]     chr4     2770650-2771300      * |  PU1_peak_83142
  [991]     chr4     6938094-6938682      * |  PU1_peak_83302
  [992]     chr5   43338830-43339452      * |  PU1_peak_92074
               V5          V6        V7        V8        V9       V10
        <integer> <character> <numeric> <numeric> <numeric> <integer>
    [1]      4096           .   118.523   419.103   409.612       283
    [2]      4077           .   117.953   416.663   407.774       433
    [3]      4005           .   109.006   408.306   400.571       296
    [4]      3968           .   105.868   404.343   396.825       363
    [5]      3958           .   119.895   403.294   395.828       964
    ...       ...         ...       ...       ...       ...       ...
  [988]      1913           .   60.1587   195.888   191.332       194
  [989]      1913           .   60.1587   195.888   191.332       357
  [990]      1913           .   60.1587   195.888   191.332       297
  [991]      1913           .   60.1587   195.888   191.332       309
  [992]      1913           .   60.1587   195.888   191.332       340
         annotated_genes     dist_to_TSS
         <CharacterList>   <IntegerList>
    [1]         DEFB130B          212527
    [2] SLC25A32,DCSTAMP   -827596,-8758
    [3]       CRYM,VWA3A -487326,-301535
    [4]             ADGB           -2504
    [5]    FAM43A,XXYLT1    50565,533280
    ...              ...             ...
  [988]      SPRING1,HRK   -119900,22979
  [989]  NEUROD1,ITPRID2  -169948,-39611
  [990]     TNIP2,SH3BP2   -14308,-21771
  [991]   TBC1D14,TADA2B   28852,-103217
  [992]     HMGCS1,CCL28    -25318,72939
  -------
  seqinfo: 87 sequences from an unspecified genome; no seqlengths

plotRegionGeneAssociations(res)

LOLA

LOLA（https://code.databio.org/LOLA/）是一种用于评估基因组区域集合之间重叠显著性的方法。其核心原理确实基于Fisher Exact Test，可阅读参考文献获取更详细的信息。

基本原理

LOLA的基本分析方法可用上图来描述。我们要分析的区间集叫做Query Set，它全部落在背景Universe Set中，要测试的是Query Set和知识库中的Region Set是否有显著性的重合。为了衡量显著性，需要计算四个值：Both Support (a)、Query ONLY (b)、Database ONLY (c)、Neither (d)。利用这四个值就可以使用Fisher’s Exact Test计算出显著性。

测试数据

我们将使用前面课程中的ATAC-seq数据作为测试数据。我们尝试了两种方法进行Peak Calling，并比较了Treatment和Control之间的差别，分别得到了三个文件：

ls -la macs2_atac | grep -v summits | grep bed

-rw-r--r-- 1 biocourse01 ods       197711 Sep 11 14:13 C_unique.bed
-rw-r--r-- 1 biocourse01 ods       412527 Sep 11 14:13 TC_common.bed
-rw-r--r-- 1 biocourse01 ods       177027 Sep 11 14:13 T_unique.bed

ls -la macs2_atac_PE | grep -v summits | grep bed

-rw-r--r-- 1 biocourse01 ods       359519 Sep 11 15:07 C_unique.bed
-rw-r--r-- 1 biocourse01 ods       523892 Sep 11 15:07 TC_common.bed
-rw-r--r-- 1 biocourse01 ods       336977 Sep 11 15:07 T_unique.bed

为了练习的方便，我们可以将这些文件链接到自己的目录下：

ln -s /sibcb1/bioinformatics/biocourse/biocourse01/epigenomics/macs2_atac .
ln -s /sibcb1/bioinformatics/biocourse/biocourse01/epigenomics/macs2_atac_PE .
ln -s /sibcb1/bioinformatics/biocourse/biocourse01/epigenomics/db .

构建RegionDB

需要准备一个文件夹，比如PE，其中包含regions子文件夹，该文件夹中是BED文件，每个文件代表一个Region Set。同时，还需要一个index.txt文件，包含每个BED文件的文件名和描述，比如：

head db/Knowledge/PE/index.txt

filename    description
C_unique.bed    Control unique
T_unique.bed    Treatment unique
TC_common.bed   Shared

分析流程

我们来检测一下两种方法做出来是否有显著性重合。

suppressPackageStartupMessages(library(ChIPseeker))
peaks = loadRegionDB('db/Query', useCache = TRUE)
regionDB = loadRegionDB('db/Knowledge', useCache = TRUE)
regionDB[[2]]

Index: <collection>
        filename cellType      description tissue dataSource antibody
          <char>   <char>           <char> <char>     <char>   <char>
1:  C_unique.bed     <NA>   Control unique   <NA>       <NA>     <NA>
2: TC_common.bed     <NA>           Shared   <NA>       <NA>     <NA>
3:  T_unique.bed     <NA> Treatment unique   <NA>       <NA>     <NA>
   treatment collection  size
      <char>     <char> <num>
1:      <NA>         PE  7317
2:      <NA>         PE 10510
3:      <NA>         PE  6840

userUniverse = reduce(unlist(peaks[[4]]))

# 检验C_unique
peaks[[4]][[1]]

GRanges object with 4041 ranges and 0 metadata columns:
                   seqnames              ranges strand
                      <Rle>           <IRanges>  <Rle>
     C_unique.bed1     chr1     6448054-6449305      *
     C_unique.bed2     chr1     6984157-6984946      *
     C_unique.bed3     chr1     7671600-7672198      *
     C_unique.bed4     chr1     8581664-8582488      *
     C_unique.bed5     chr1     8946701-8947290      *
               ...      ...                 ...    ...
  C_unique.bed4037     chrX 165158152-165158714      *
  C_unique.bed4038     chrX 166528115-166528621      *
  C_unique.bed4039     chrX 166618602-166619852      *
  C_unique.bed4040     chrY   90804237-90808037      *
  C_unique.bed4041     chrY   90809741-90810441      *
  -------
  seqinfo: 21 sequences from an unspecified genome; no seqlengths

runLOLA(peaks[[4]][[1]], userUniverse, regionDB, cores=1)

Key: <dbSet>
   userSet dbSet collection pValueLog    oddsRatio support rnkPV
     <int> <int>     <char>     <num>        <num>   <int> <int>
1:       1     1         PE  322.0052 134.16287058    3499     1
2:       1     2         PE    0.0000   0.03525545     204     2
3:       1     3         PE    0.0000   0.00000000       0     2
   rnkOR rnkSup maxRnk meanRnk     b     c     d      description
   <int>  <int>  <int>   <num> <int> <int> <int>           <char>
1:     1      1      1    1.00   529   542 11009   Control unique
2:     2      2      2    2.00  6938  3837  4600           Shared
3:     3      3      3    2.67  3676  4041  7862 Treatment unique
   cellType tissue antibody treatment dataSource      filename qValue
     <char> <char>   <char>    <char>     <char>        <char>  <num>
1:     <NA>   <NA>     <NA>      <NA>       <NA>  C_unique.bed      0
2:     <NA>   <NA>     <NA>      <NA>       <NA> TC_common.bed      1
3:     <NA>   <NA>     <NA>      <NA>       <NA>  T_unique.bed      1
    size
   <num>
1:  7317
2: 10510
3:  6840

# 检验TC_common
peaks[[4]][[2]]

GRanges object with 8286 ranges and 0 metadata columns:
                    seqnames            ranges strand
                       <Rle>         <IRanges>  <Rle>
     TC_common.bed1     chr1   3671665-3672385      *
     TC_common.bed2     chr1   4491978-4493829      *
     TC_common.bed3     chr1   4660138-4660935      *
     TC_common.bed4     chr1   4785362-4785923      *
     TC_common.bed5     chr1   4802561-4803297      *
                ...      ...               ...    ...
  TC_common.bed8282     chrY 90803513-90804673      *
  TC_common.bed8283     chrY 90805874-90807224      *
  TC_common.bed8284     chrY 90807763-90813258      *
  TC_common.bed8285     chrY 90824996-90825596      *
  TC_common.bed8286     chrY 90828388-90830153      *
  -------
  seqinfo: 21 sequences from an unspecified genome; no seqlengths

runLOLA(peaks[[4]][[2]], userUniverse, regionDB, cores=1)

   userSet dbSet collection pValueLog    oddsRatio support rnkPV
     <int> <int>     <char>     <num>        <num>   <int> <int>
1:       1     2         PE  322.0052 261.28768977    6994     1
2:       1     1         PE    0.0000   0.12691658     770     2
3:       1     3         PE    0.0000   0.05526242     364     2
   rnkOR rnkSup maxRnk meanRnk     b     c     d      description
   <int>  <int>  <int>   <num> <int> <int> <int>           <char>
1:     1      1      1    1.00   148  1292  7145           Shared
2:     2      2      2    2.00  3258  7516  4035   Control unique
3:     3      3      3    2.67  3312  7922  3981 Treatment unique
   cellType tissue antibody treatment dataSource      filename qValue
     <char> <char>   <char>    <char>     <char>        <char>  <num>
1:     <NA>   <NA>     <NA>      <NA>       <NA> TC_common.bed      0
2:     <NA>   <NA>     <NA>      <NA>       <NA>  C_unique.bed      1
3:     <NA>   <NA>     <NA>      <NA>       <NA>  T_unique.bed      1
    size
   <num>
1: 10510
2:  7317
3:  6840

# 检验T_unique
peaks[[4]][[3]]

GRanges object with 3599 ranges and 0 metadata columns:
                   seqnames              ranges strand
                      <Rle>           <IRanges>  <Rle>
     T_unique.bed1     chr1     3175366-3176112      *
     T_unique.bed2     chr1     3934241-3934936      *
     T_unique.bed3     chr1     5250694-5251311      *
     T_unique.bed4     chr1     5820048-5820594      *
     T_unique.bed5     chr1     7487947-7488620      *
               ...      ...                 ...    ...
  T_unique.bed3595     chrX 157380362-157381192      *
  T_unique.bed3596     chrX 159516419-159517041      *
  T_unique.bed3597     chrY     6769716-6770455      *
  T_unique.bed3598     chrY   89126019-89126623      *
  T_unique.bed3599     chrY   90738653-90741018      *
  -------
  seqinfo: 21 sequences from an unspecified genome; no seqlengths

runLOLA(peaks[[4]][[3]], userUniverse, regionDB, cores=1)

   userSet dbSet collection pValueLog    oddsRatio support rnkPV
     <int> <int>     <char>     <num>        <num>   <int> <int>
1:       1     3         PE  322.0052 612.82476996    3380     1
2:       1     2         PE    0.0000   0.01354958      69     2
3:       1     1         PE    0.0000   0.00000000       0     2
   rnkOR rnkSup maxRnk meanRnk     b     c     d      description
   <int>  <int>  <int>   <num> <int> <int> <int>           <char>
1:     1      1      1    1.00   296   219 11684 Treatment unique
2:     2      2      2    2.00  7073  3530  4907           Shared
3:     3      3      3    2.67  4028  3599  7952   Control unique
   cellType tissue antibody treatment dataSource      filename qValue
     <char> <char>   <char>    <char>     <char>        <char>  <num>
1:     <NA>   <NA>     <NA>      <NA>       <NA>  T_unique.bed      0
2:     <NA>   <NA>     <NA>      <NA>       <NA> TC_common.bed      1
3:     <NA>   <NA>     <NA>      <NA>       <NA>  C_unique.bed      1
    size
   <num>
1:  6840
2: 10510
3:  7317