我们将使用系统内安装的R:
# bash命令
export PATH="/sibcb1/bioinformatics/biocourse/biocourse01/miniforge3/envs/scRNA/bin:$PATH"
加载需要使用的软件包:
log2 fold change (MLE): group stroke vs ctrl
Wald test p-value: group stroke vs ctrl
DataFrame with 6 rows and 8 columns
baseMean log2FoldChange lfcSE stat pvalue
<numeric> <numeric> <numeric> <numeric> <numeric>
A1BG 138.9053 0.5696487 0.366932 1.5524628 0.12055154
A1BG-AS1 94.0295 -0.0194117 0.329088 -0.0589862 0.95296312
A1CF 12.2568 -0.0143727 0.875907 -0.0164089 0.98690817
A2M 175.9281 -1.5962724 0.612288 -2.6070619 0.00913229
A2M-AS1 70.6930 -0.9688552 0.494190 -1.9604894 0.04993861
A2ML1 20.0849 -0.3049334 0.624204 -0.4885156 0.62518465
padj DEG LogFDR
<numeric> <character> <numeric>
A1BG 0.2522373 NC 0.59819076
A1BG-AS1 0.9747968 NC 0.01108592
A1CF 0.9929060 NC 0.00309186
A2M 0.0325865 DN 1.48696254
A2M-AS1 0.1296658 NC 0.88717461
A2ML1 0.7703000 NC 0.11334008
超几何分布检验是做通路富集分析最常用的方法:
DE | non-DE | total | |
---|---|---|---|
Insiside Gene Set | k | m-k | m |
Outside Gene Set | n-k | N+k-n-m | N-m |
total | n | N-n | N |
其中:
在R
中可以使用如下函数检验显著性:
phyper(k, m, N-m, n, lower.tail = FALSE)
# log2FoldChange是可选的筛选的标准
res = na.exclude(res)
Up = rownames(filter(data.frame(res), DEG == "UP", log2FoldChange > 1))
Dn = rownames(filter(data.frame(res), DEG == "DN", log2FoldChange < -1))
# 背景
Pg = rownames(res)
ID之间的转换:
ego <- enrichGO(gene = gene.df$ENTREZID,
OrgDb = org.Hs.eg.db,
keyType = "ENTREZID",
ont = "BP",
pAdjustMethod = "BH",
pvalueCutoff = 0.01,
qvalueCutoff = 0.05,
readable = TRUE)
head(ego[, 1:7])
ID
GO:0061644 GO:0061644
GO:0006334 GO:0006334
GO:0034728 GO:0034728
GO:0030099 GO:0030099
GO:0061515 GO:0061515
GO:0010506 GO:0010506
Description
GO:0061644 protein localization to CENP-A containing chromatin
GO:0006334 nucleosome assembly
GO:0034728 nucleosome organization
GO:0030099 myeloid cell differentiation
GO:0061515 myeloid cell development
GO:0010506 regulation of autophagy
GeneRatio BgRatio pvalue p.adjust qvalue
GO:0061644 15/2235 18/18870 6.981139e-12 4.162155e-08 3.859468e-08
GO:0006334 42/2235 118/18870 1.601777e-11 4.774896e-08 4.427648e-08
GO:0034728 44/2235 138/18870 3.343898e-10 6.645439e-07 6.162158e-07
GO:0030099 95/2235 430/18870 9.466825e-10 1.411030e-06 1.308415e-06
GO:0061515 29/2235 85/18870 6.333996e-08 6.752295e-05 6.261243e-05
GO:0010506 77/2235 355/18870 8.253368e-08 6.752295e-05 6.261243e-05
可视化:
dotplot(ego, showCategory=20)+
theme(axis.text.y = element_text(size=15), axis.text.x = element_text(size = 15), legend.text = element_text(size=15),
panel.grid.major = element_blank(),panel.grid.minor = element_blank())
也可以使用barplot来展示:
barplot(ego,showCategory=20)+
theme(axis.text.y = element_text(size=15), axis.text.x = element_text(size = 15),
legend.text = element_text(size=15), panel.grid = element_blank())
还可以自定义展示FDR:
ggplot(ego[1:20,], aes(x=-log(p.adjust, base=10), y=reorder(Description,p.adjust,decreasing = T))) +
geom_bar(stat = "identity", fill="red") +
theme_bw() +
xlab(label='-log10 FDR') +
ylab(label='') +
theme(axis.text.y = element_text(size=12,colour="black"), axis.text.x = element_text(size = 15,colour="black"),
panel.grid= element_blank())
kk <- enrichKEGG(gene = gene.df$ENTREZID,
organism = 'hsa',
pAdjustMethod = "BH",
pvalueCutoff = 0.01,
qvalueCutoff = 0.05,)
head(kk[,1:7])
category subcategory
hsa04613 Organismal Systems Immune system
hsa05034 Human Diseases Substance dependence
hsa05322 Human Diseases Immune disease
hsa04010 Environmental Information Processing Signal transduction
hsa05231 Human Diseases Cancer: overview
hsa05202 Human Diseases Cancer: overview
ID Description GeneRatio
hsa04613 hsa04613 Neutrophil extracellular trap formation 75/1133
hsa05034 hsa05034 Alcoholism 72/1133
hsa05322 hsa05322 Systemic lupus erythematosus 59/1133
hsa04010 hsa04010 MAPK signaling pathway 69/1133
hsa05231 hsa05231 Choline metabolism in cancer 29/1133
hsa05202 hsa05202 Transcriptional misregulation in cancer 46/1133
BgRatio pvalue
hsa04613 192/8842 1.601158e-20
hsa05034 188/8842 3.605844e-19
hsa05322 139/8842 1.959772e-18
hsa04010 300/8842 5.497397e-07
hsa05231 99/8842 1.043840e-05
hsa05202 193/8842 1.640314e-05
需要注意的是,在分析GO和KEGG富集的时候,需要根据情况设定背景基因集,在enrichGO
和enrichKEGG
中,默认的是数据库中所有的基因(All genes in the TERM2GENE table)。
MSigDB是一个Molucular Signatures Database
, 也包含了多种数据库通路的基因集,如KEGG
,‘GO’,‘BIOCARTA’, REACTOME
等。
x <- msigdbr_collections()
print(x, n = 1000)
# A tibble: 23 × 3
gs_cat gs_subcat num_genesets
<chr> <chr> <int>
1 C1 "" 299
2 C2 "CGP" 3384
3 C2 "CP" 29
4 C2 "CP:BIOCARTA" 292
5 C2 "CP:KEGG" 186
6 C2 "CP:PID" 196
7 C2 "CP:REACTOME" 1615
8 C2 "CP:WIKIPATHWAYS" 664
9 C3 "MIR:MIRDB" 2377
10 C3 "MIR:MIR_Legacy" 221
11 C3 "TFT:GTRD" 518
12 C3 "TFT:TFT_Legacy" 610
13 C4 "CGN" 427
14 C4 "CM" 431
15 C5 "GO:BP" 7658
16 C5 "GO:CC" 1006
17 C5 "GO:MF" 1738
18 C5 "HPO" 5071
19 C6 "" 189
20 C7 "IMMUNESIGDB" 4872
21 C7 "VAX" 347
22 C8 "" 700
23 H "" 50
还是选择其中的KEGG
进行分析:
# A tibble: 6 × 15
gs_cat gs_subcat gs_name gene_symbol entrez_gene ensembl_gene
<chr> <chr> <chr> <chr> <int> <chr>
1 C2 CP:KEGG KEGG_ABC_TRAN… ABCA1 19 ENSG0000016…
2 C2 CP:KEGG KEGG_ABC_TRAN… ABCA10 10349 ENSG0000015…
3 C2 CP:KEGG KEGG_ABC_TRAN… ABCA12 26154 ENSG0000014…
4 C2 CP:KEGG KEGG_ABC_TRAN… ABCA13 154664 ENSG0000017…
5 C2 CP:KEGG KEGG_ABC_TRAN… ABCA2 20 ENSG0000010…
6 C2 CP:KEGG KEGG_ABC_TRAN… ABCA3 21 ENSG0000016…
# ℹ 9 more variables: human_gene_symbol <chr>,
# human_entrez_gene <int>, human_ensembl_gene <chr>, gs_id <chr>,
# gs_pmid <chr>, gs_geoid <chr>, gs_exact_source <chr>,
# gs_url <chr>, gs_description <chr>
制作Term to Gene
表:
KEGG_DB_t2g <- KEGG_DB %>% dplyr::distinct(gs_name, entrez_gene) %>% as.data.frame()
head(KEGG_DB_t2g)
gs_name entrez_gene
1 KEGG_ABC_TRANSPORTERS 19
2 KEGG_ABC_TRANSPORTERS 10349
3 KEGG_ABC_TRANSPORTERS 26154
4 KEGG_ABC_TRANSPORTERS 154664
5 KEGG_ABC_TRANSPORTERS 20
6 KEGG_ABC_TRANSPORTERS 21
进行富集分析:
upKEGG2 <- enricher(gene = gene.df$ENTREZID, universe = backgroud, TERM2GENE = KEGG_DB_t2g, pvalueCutoff = 0.05)
upKEGG2 <- setReadable(upKEGG2, OrgDb = 'org.Hs.eg.db', keyType = 'ENTREZID')
head(upKEGG2[,1:7])
ID
KEGG_SYSTEMIC_LUPUS_ERYTHEMATOSUS KEGG_SYSTEMIC_LUPUS_ERYTHEMATOSUS
KEGG_GLYCEROPHOSPHOLIPID_METABOLISM KEGG_GLYCEROPHOSPHOLIPID_METABOLISM
Description
KEGG_SYSTEMIC_LUPUS_ERYTHEMATOSUS KEGG_SYSTEMIC_LUPUS_ERYTHEMATOSUS
KEGG_GLYCEROPHOSPHOLIPID_METABOLISM KEGG_GLYCEROPHOSPHOLIPID_METABOLISM
GeneRatio BgRatio pvalue
KEGG_SYSTEMIC_LUPUS_ERYTHEMATOSUS 59/722 117/4402 5.707778e-18
KEGG_GLYCEROPHOSPHOLIPID_METABOLISM 24/722 70/4402 1.798461e-04
p.adjust qvalue
KEGG_SYSTEMIC_LUPUS_ERYTHEMATOSUS 1.010277e-15 9.673182e-16
KEGG_GLYCEROPHOSPHOLIPID_METABOLISM 1.591638e-02 1.523959e-02
可视化:
dotplot(upKEGG2,showCategory=20)+
theme(axis.text.y = element_text(size=15), axis.text.x = element_text(size = 15), legend.text = element_text(size=15),
panel.grid.major = element_blank(),panel.grid.minor = element_blank())
GSEA是广泛使用的通路富集分析方法,当样本的生物学重复数较少时,可以按照差异表达水平对基因进行排序,然后检验给定的基因集是否随机分布在这个排序上。如果不随机,说明显著性富集。
fc = data.frame(gene_symbol = rownames(res), log2FoldChange = res$log2FoldChange)
fc = filter(fc, !is.na(log2FoldChange))
head(fc)
gene_symbol log2FoldChange
1 A1BG 0.56964866
2 A1BG-AS1 -0.01941167
3 A1CF -0.01437268
4 A2M -1.59627245
5 A2M-AS1 -0.96885522
6 A2ML1 -0.30493344
write.table(fc, file='training_stroke_vs_control_DESeq2_log2FC.rnk', sep='\t', col.names=F, row.names=F, quote=F)
在命令行运行GSEA:
/sibcb1/bioinformatics/biocourse/biocourse01/software/GSEA_Linux_4.3.3/gsea-cli.sh GSEAPreranked \
-rnk training_stroke_vs_control_DESeq2_log2FC.rnk \
-gmx c2.cp.kegg_medicus.v2023.2.Hs.symbols.gmt \
-collapse false \
-scoring_scheme weighted \
-nperm 1000 \
-rpt_label kegg \
-make_sets true \
-plot_top_x 20 \
-set_max 500 -set_min 15 \
-zip_report false \
-create_svgs true \
-out gsea
分析结果在gsea
文件夹中。