Bulk RNA-seq的功能富集分析

全局设定

我们将使用系统内安装的R:

# bash命令
export PATH="/sibcb1/bioinformatics/biocourse/biocourse01/miniforge3/envs/scRNA/bin:$PATH"

加载需要使用的软件包:

读入数据

res = readRDS(file = "example_DESeq2.RDS")
head(res)
log2 fold change (MLE): group stroke vs ctrl 
Wald test p-value: group stroke vs ctrl 
DataFrame with 6 rows and 8 columns
          baseMean log2FoldChange     lfcSE       stat     pvalue
         <numeric>      <numeric> <numeric>  <numeric>  <numeric>
A1BG      138.9053      0.5696487  0.366932  1.5524628 0.12055154
A1BG-AS1   94.0295     -0.0194117  0.329088 -0.0589862 0.95296312
A1CF       12.2568     -0.0143727  0.875907 -0.0164089 0.98690817
A2M       175.9281     -1.5962724  0.612288 -2.6070619 0.00913229
A2M-AS1    70.6930     -0.9688552  0.494190 -1.9604894 0.04993861
A2ML1      20.0849     -0.3049334  0.624204 -0.4885156 0.62518465
              padj         DEG     LogFDR
         <numeric> <character>  <numeric>
A1BG     0.2522373          NC 0.59819076
A1BG-AS1 0.9747968          NC 0.01108592
A1CF     0.9929060          NC 0.00309186
A2M      0.0325865          DN 1.48696254
A2M-AS1  0.1296658          NC 0.88717461
A2ML1    0.7703000          NC 0.11334008

Over Representation Analysis (ORA)

基本原理

超几何分布检验是做通路富集分析最常用的方法:

DE non-DE total
Insiside Gene Set k m-k m
Outside Gene Set n-k N+k-n-m N-m
total n N-n N

其中:

R中可以使用如下函数检验显著性:

phyper(k, m, N-m, n, lower.tail = FALSE)

定义输入

# log2FoldChange是可选的筛选的标准
res = na.exclude(res)
Up = rownames(filter(data.frame(res), DEG == "UP", log2FoldChange >  1))
Dn = rownames(filter(data.frame(res), DEG == "DN", log2FoldChange < -1))

# 背景
Pg = rownames(res)

ID之间的转换:

gene.df <- bitr(Up, fromType = "SYMBOL", toType = "ENTREZID", OrgDb = org.Hs.eg.db)

# background
backgroud <- bitr(rownames(res), fromType = 'SYMBOL', toType = 'ENTREZID', OrgDb = 'org.Hs.eg.db')$ENTREZID

检验GO富集

ego <- enrichGO(gene          = gene.df$ENTREZID,
                OrgDb         = org.Hs.eg.db,
                keyType       = "ENTREZID",
                ont           = "BP",
                pAdjustMethod = "BH",
                pvalueCutoff  = 0.01,
                qvalueCutoff  = 0.05,
                readable      = TRUE)
head(ego[, 1:7])
                   ID
GO:0061644 GO:0061644
GO:0006334 GO:0006334
GO:0034728 GO:0034728
GO:0030099 GO:0030099
GO:0061515 GO:0061515
GO:0010506 GO:0010506
                                                   Description
GO:0061644 protein localization to CENP-A containing chromatin
GO:0006334                                 nucleosome assembly
GO:0034728                             nucleosome organization
GO:0030099                        myeloid cell differentiation
GO:0061515                            myeloid cell development
GO:0010506                             regulation of autophagy
           GeneRatio   BgRatio       pvalue     p.adjust       qvalue
GO:0061644   15/2235  18/18870 6.981139e-12 4.162155e-08 3.859468e-08
GO:0006334   42/2235 118/18870 1.601777e-11 4.774896e-08 4.427648e-08
GO:0034728   44/2235 138/18870 3.343898e-10 6.645439e-07 6.162158e-07
GO:0030099   95/2235 430/18870 9.466825e-10 1.411030e-06 1.308415e-06
GO:0061515   29/2235  85/18870 6.333996e-08 6.752295e-05 6.261243e-05
GO:0010506   77/2235 355/18870 8.253368e-08 6.752295e-05 6.261243e-05

可视化:

dotplot(ego, showCategory=20)+ 
           theme(axis.text.y = element_text(size=15), axis.text.x = element_text(size = 15), legend.text = element_text(size=15), 
                 panel.grid.major = element_blank(),panel.grid.minor = element_blank())

也可以使用barplot来展示:

barplot(ego,showCategory=20)+ 
           theme(axis.text.y = element_text(size=15), axis.text.x = element_text(size = 15),
                 legend.text = element_text(size=15), panel.grid = element_blank())

还可以自定义展示FDR:

ggplot(ego[1:20,], aes(x=-log(p.adjust, base=10), y=reorder(Description,p.adjust,decreasing = T))) + 
 geom_bar(stat = "identity", fill="red") +
 theme_bw() +
 xlab(label='-log10 FDR') +
 ylab(label='') +
 theme(axis.text.y = element_text(size=12,colour="black"), axis.text.x = element_text(size = 15,colour="black"), 
      panel.grid= element_blank())

检验KEGG通路富集

kk <- enrichKEGG(gene         = gene.df$ENTREZID,
                 organism     = 'hsa',
                 pAdjustMethod = "BH",
                 pvalueCutoff = 0.01,
                 qvalueCutoff = 0.05,)
head(kk[,1:7])
                                     category          subcategory
hsa04613                   Organismal Systems        Immune system
hsa05034                       Human Diseases Substance dependence
hsa05322                       Human Diseases       Immune disease
hsa04010 Environmental Information Processing  Signal transduction
hsa05231                       Human Diseases     Cancer: overview
hsa05202                       Human Diseases     Cancer: overview
               ID                             Description GeneRatio
hsa04613 hsa04613 Neutrophil extracellular trap formation   75/1133
hsa05034 hsa05034                              Alcoholism   72/1133
hsa05322 hsa05322            Systemic lupus erythematosus   59/1133
hsa04010 hsa04010                  MAPK signaling pathway   69/1133
hsa05231 hsa05231            Choline metabolism in cancer   29/1133
hsa05202 hsa05202 Transcriptional misregulation in cancer   46/1133
          BgRatio       pvalue
hsa04613 192/8842 1.601158e-20
hsa05034 188/8842 3.605844e-19
hsa05322 139/8842 1.959772e-18
hsa04010 300/8842 5.497397e-07
hsa05231  99/8842 1.043840e-05
hsa05202 193/8842 1.640314e-05

需要注意的是,在分析GO和KEGG富集的时候,需要根据情况设定背景基因集,在enrichGOenrichKEGG中,默认的是数据库中所有的基因(All genes in the TERM2GENE table)。

使用MSigDB数据库

MSigDB是一个Molucular Signatures Database, 也包含了多种数据库通路的基因集,如KEGG,‘GO’,‘BIOCARTA’, REACTOME等。

x <- msigdbr_collections()
print(x, n = 1000)
# A tibble: 23 × 3
   gs_cat gs_subcat         num_genesets
   <chr>  <chr>                    <int>
 1 C1     ""                         299
 2 C2     "CGP"                     3384
 3 C2     "CP"                        29
 4 C2     "CP:BIOCARTA"              292
 5 C2     "CP:KEGG"                  186
 6 C2     "CP:PID"                   196
 7 C2     "CP:REACTOME"             1615
 8 C2     "CP:WIKIPATHWAYS"          664
 9 C3     "MIR:MIRDB"               2377
10 C3     "MIR:MIR_Legacy"           221
11 C3     "TFT:GTRD"                 518
12 C3     "TFT:TFT_Legacy"           610
13 C4     "CGN"                      427
14 C4     "CM"                       431
15 C5     "GO:BP"                   7658
16 C5     "GO:CC"                   1006
17 C5     "GO:MF"                   1738
18 C5     "HPO"                     5071
19 C6     ""                         189
20 C7     "IMMUNESIGDB"             4872
21 C7     "VAX"                      347
22 C8     ""                         700
23 H      ""                          50

还是选择其中的KEGG进行分析:

KEGG_DB  <- msigdbr(species = "human", category = "C2", subcategory = 'KEGG')
head(KEGG_DB)
# A tibble: 6 × 15
  gs_cat gs_subcat gs_name        gene_symbol entrez_gene ensembl_gene
  <chr>  <chr>     <chr>          <chr>             <int> <chr>       
1 C2     CP:KEGG   KEGG_ABC_TRAN… ABCA1                19 ENSG0000016…
2 C2     CP:KEGG   KEGG_ABC_TRAN… ABCA10            10349 ENSG0000015…
3 C2     CP:KEGG   KEGG_ABC_TRAN… ABCA12            26154 ENSG0000014…
4 C2     CP:KEGG   KEGG_ABC_TRAN… ABCA13           154664 ENSG0000017…
5 C2     CP:KEGG   KEGG_ABC_TRAN… ABCA2                20 ENSG0000010…
6 C2     CP:KEGG   KEGG_ABC_TRAN… ABCA3                21 ENSG0000016…
# ℹ 9 more variables: human_gene_symbol <chr>,
#   human_entrez_gene <int>, human_ensembl_gene <chr>, gs_id <chr>,
#   gs_pmid <chr>, gs_geoid <chr>, gs_exact_source <chr>,
#   gs_url <chr>, gs_description <chr>

制作Term to Gene表:

KEGG_DB_t2g <- KEGG_DB %>% dplyr::distinct(gs_name, entrez_gene) %>% as.data.frame()
head(KEGG_DB_t2g)
                gs_name entrez_gene
1 KEGG_ABC_TRANSPORTERS          19
2 KEGG_ABC_TRANSPORTERS       10349
3 KEGG_ABC_TRANSPORTERS       26154
4 KEGG_ABC_TRANSPORTERS      154664
5 KEGG_ABC_TRANSPORTERS          20
6 KEGG_ABC_TRANSPORTERS          21

进行富集分析:

upKEGG2 <- enricher(gene = gene.df$ENTREZID, universe = backgroud, TERM2GENE = KEGG_DB_t2g, pvalueCutoff = 0.05)
upKEGG2 <- setReadable(upKEGG2, OrgDb = 'org.Hs.eg.db', keyType = 'ENTREZID')
head(upKEGG2[,1:7])
                                                                     ID
KEGG_SYSTEMIC_LUPUS_ERYTHEMATOSUS     KEGG_SYSTEMIC_LUPUS_ERYTHEMATOSUS
KEGG_GLYCEROPHOSPHOLIPID_METABOLISM KEGG_GLYCEROPHOSPHOLIPID_METABOLISM
                                                            Description
KEGG_SYSTEMIC_LUPUS_ERYTHEMATOSUS     KEGG_SYSTEMIC_LUPUS_ERYTHEMATOSUS
KEGG_GLYCEROPHOSPHOLIPID_METABOLISM KEGG_GLYCEROPHOSPHOLIPID_METABOLISM
                                    GeneRatio  BgRatio       pvalue
KEGG_SYSTEMIC_LUPUS_ERYTHEMATOSUS      59/722 117/4402 5.707778e-18
KEGG_GLYCEROPHOSPHOLIPID_METABOLISM    24/722  70/4402 1.798461e-04
                                        p.adjust       qvalue
KEGG_SYSTEMIC_LUPUS_ERYTHEMATOSUS   1.010277e-15 9.673182e-16
KEGG_GLYCEROPHOSPHOLIPID_METABOLISM 1.591638e-02 1.523959e-02

可视化:

dotplot(upKEGG2,showCategory=20)+ 
           theme(axis.text.y = element_text(size=15), axis.text.x = element_text(size = 15), legend.text = element_text(size=15), 
                 panel.grid.major = element_blank(),panel.grid.minor = element_blank())

GSEA分析

GSEA是广泛使用的通路富集分析方法,当样本的生物学重复数较少时,可以按照差异表达水平对基因进行排序,然后检验给定的基因集是否随机分布在这个排序上。如果不随机,说明显著性富集。

fc = data.frame(gene_symbol = rownames(res), log2FoldChange = res$log2FoldChange)
fc = filter(fc, !is.na(log2FoldChange))
head(fc)
  gene_symbol log2FoldChange
1        A1BG     0.56964866
2    A1BG-AS1    -0.01941167
3        A1CF    -0.01437268
4         A2M    -1.59627245
5     A2M-AS1    -0.96885522
6       A2ML1    -0.30493344
write.table(fc, file='training_stroke_vs_control_DESeq2_log2FC.rnk', sep='\t', col.names=F, row.names=F, quote=F)

在命令行运行GSEA:

/sibcb1/bioinformatics/biocourse/biocourse01/software/GSEA_Linux_4.3.3/gsea-cli.sh GSEAPreranked \
  -rnk training_stroke_vs_control_DESeq2_log2FC.rnk \
  -gmx c2.cp.kegg_medicus.v2023.2.Hs.symbols.gmt \
  -collapse false \
  -scoring_scheme weighted \
  -nperm 1000 \
  -rpt_label kegg \
  -make_sets true \
  -plot_top_x 20 \
  -set_max 500 -set_min 15 \
  -zip_report false \
  -create_svgs true \
  -out gsea

分析结果在gsea文件夹中。