使用Seurat进行差异基因表达分析

全局设定

加载R包:

在这里,我们通过两个示例数据集,展示scRNA-seq差异基因表达分析的基本方法,并通过热图展示结果。

单样本分析

数据加载

col_palatte <- c('#6965CE', '#EE5755', '#75E15A', '#D5CE85', '#9F33D9', '#9BE2CF', '#69A5C8', 
    '#FC9500', '#D846AF', '#CE95C4', '#0018F9')
pbmc = readRDS("output/pbmc.RDS")
DimPlot(pbmc, label = F, pt.size = 0.0001, cols = col_palatte)

One versus all

我们推荐使用秩和检验对单细胞数据进行差异表达分析:test.use = “wilcox”

m1 = FindMarkers(pbmc, ident.1 = "FCR3A+ Monocytes", only.pos = TRUE, test.use = "wilcox")
m1 %>% head() %>% arrange(-pct.1)
                      p_val avg_log2FC pct.1 pct.2     p_val_adj
FCGR3A        3.511192e-184   4.064902 0.975 0.134 4.815249e-180
RP11-290F20.3 8.429376e-200   4.063794 0.840 0.068 1.156005e-195
MS4A7         1.710321e-186   4.387850 0.809 0.073 2.345534e-182
HES4          1.516412e-207   4.993200 0.586 0.020 2.079608e-203
CDKN1C        1.689064e-212   5.428812 0.506 0.010 2.316382e-208
CKB           8.230606e-168   5.881212 0.370 0.005 1.128745e-163

这里我们想比较FCR3A+ Monocytes和其他细胞之间表达差异,也就是One versus all,可以看到FCGR3A是差异表达最显著的基因之一。默认情况下,ident.1提取的是Idents的结果,我们也可以认为指定其他的变量。

table(pbmc[[]] %>% pull(seurat_clusters), pbmc %>% Idents)
   
    Naive T cells CD14+ Monocytes Memory CD4+ T cells   B CD8+ T
  0           684               0                   0   0      0
  1             0             481                   0   0      0
  2             0               0                 476   0      0
  3             0               0                   0 344      0
  4             0               0                   0   0    291
  5             0               0                   0   0      0
  6             0               0                   0   0      0
  7             0               0                   0   0      0
  8             0               0                   0   0      0
   
    FCR3A+ Monocytes  NK  DC Platelet
  0                0   0   0        0
  1                0   0   0        0
  2                0   0   0        0
  3                0   0   0        0
  4                0   0   0        0
  5              162   0   0        0
  6                0 155   0        0
  7                0   0  32        0
  8                0   0   0       13
m2 = FindMarkers(pbmc, ident.1 = 5, group.by = 'seurat_clusters', only.pos = TRUE, test.use = "wilcox")
m2 %>% head() %>% arrange(-pct.1)
                      p_val avg_log2FC pct.1 pct.2     p_val_adj
FCGR3A        3.511192e-184   4.064902 0.975 0.134 4.815249e-180
RP11-290F20.3 8.429376e-200   4.063794 0.840 0.068 1.156005e-195
MS4A7         1.710321e-186   4.387850 0.809 0.073 2.345534e-182
HES4          1.516412e-207   4.993200 0.586 0.020 2.079608e-203
CDKN1C        1.689064e-212   5.428812 0.506 0.010 2.316382e-208
CKB           8.230606e-168   5.881212 0.370 0.005 1.128745e-163

这里我们比较了meta.data的’seurat_clusters’这一列所定义的cluster 5(FCR3A+ Monocytes)和其他细胞之间的差异基因表达。可以看出,我们得到了完全相同的结果。

One versus another

m3 = FindMarkers(pbmc, ident.1 = "Naive T cells", ident.2 = 'Memory CD4+ T cells', only.pos = FALSE, test.use = "wilcox")
m3 %>% filter(rownames(m3) %in% c('CCR7', 'S100A4'))
              p_val avg_log2FC pct.1 pct.2    p_val_adj
S100A4 9.932175e-69  -1.440462 0.680 0.952 1.362099e-64
CCR7   6.200586e-16   1.256239 0.447 0.252 8.503484e-12

可以看出,CCR7是Naive T cells的marker,相对于Memory CD4+ T cells是高表达的。反过来,S100A4是Memory CD4+ T cells的marker,在其中高表达。

多样本分析

这里我们分析了小鼠Nash模型的肝组织单细胞测序数据GSE129516,包含6个样本,3个处理组(nash),三个对照组(chow)。

数据加载

mouse_harmony = readRDS("output/mouse_harmony.RDS")
mouse_harmony = JoinLayers(mouse_harmony)
DimPlot(mouse_harmony, label = T, pt.size = 0.0001, cols = col_palatte)

比较两种条件

文章的Figure 3D中,作者比较了’Endo’细胞在两种状态下基因表达谱的变化,我们看看能否观测到类似的结果。

endoDEG = c('Fabp4', 'Cd36', 'Cxcl9', 'Bmp2', 'Nrp1', 'Lifr')
m4 = FindMarkers(mouse_harmony, ident.1 = "nash", ident.2 = "chow", group.by = 'sample', subset.ident = "Endo")
m4[endoDEG, ]
              p_val avg_log2FC pct.1 pct.2     p_val_adj
Fabp4  0.000000e+00  2.6730162 0.986 0.579  0.000000e+00
Cd36   0.000000e+00  1.1905355 0.927 0.588  0.000000e+00
Cxcl9  3.065900e-96  1.3880939 0.471 0.279  5.932210e-92
Bmp2  1.306220e-175 -0.8483164 0.772 0.893 2.527404e-171
Nrp1   0.000000e+00 -1.0563196 0.844 0.964  0.000000e+00
Lifr  6.993346e-301 -1.1174045 0.671 0.905 1.353143e-296

查看这些基因:

FeaturePlot(mouse_harmony, features = 'Fabp4', split.by = 'sample', cols=c('grey95', 'red3'), label=F, pt.size=0.001) & 
  theme(legend.position = c(0.05,0.8))

Heatmap

我们可以使用DoHeatmap方便的展示热图。需要注意的是,该函数默认展示的是scale.data这个slot,更清晰地展示相对表达变化,也能改成data

多细胞类别

mx = FindMarkers(pbmc, ident.1 = "Memory CD4+ T cells", only.pos = TRUE, test.use = "wilcox")
top20 = rownames(mx[1:20, ])
DoHeatmap(pbmc, features = top20, label = FALSE) + theme(legend.position="bottom")

多实验条件

mx = FindMarkers(mouse_harmony, ident.1 = "nash", ident.2 = "chow", group.by = 'sample', subset.ident = "Endo")
gx = mx %>% filter(avg_log2FC > 0, p_val_adj < 0.05) %>% head(20)
cx = mouse_harmony[[]] %>% filter(cellType == 'Endo', orig.ident %in% c('chow1', 'nash1'))
DoHeatmap(mouse_harmony, features = rownames(gx), cells = rownames(cx), group.by = "sample") + theme(legend.position="bottom")

VlnPlot

首先我们通过提取一部分数据画图,只展示Endo细胞:

Endo = mouse_harmony %>% subset(idents = 'Endo')
VlnPlot(Endo, features = endoDEG, cols = col_palatte, group.by = 'orig.ident', stack = T, flip = TRUE) + 
   theme(legend.position = 'None') + labs(title = NULL)

我们还可以展示所有的细胞类型:

VlnPlot(mouse_harmony, features = endoDEG[1:2], cols = col_palatte, group.by = 'cellType', stack = F, flip = TRUE, 
  split.by = 'sample', ncol = 1, split.plot = TRUE) + theme(legend.position = 'bottom')

sessionInfo

R version 4.3.3 (2024-02-29)
Platform: x86_64-conda-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux Server 7.7 (Maipo)

Matrix products: default
BLAS/LAPACK: /sibcb1/bioinformatics/biocourse/biocourse01/miniforge3/envs/scRNA/lib/libopenblasp-r0.3.27.so;  LAPACK version 3.12.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: Asia/Shanghai
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods  
[7] base     

other attached packages:
 [1] patchwork_1.2.0    reticulate_1.35.0  harmony_1.2.0     
 [4] Rcpp_1.0.12        Seurat_5.0.3       SeuratObject_5.0.2
 [7] sp_2.1-3           dplyr_1.1.4        data.table_1.15.2 
[10] ggplot2_3.5.0      knitr_1.46         rmarkdown_2.26    

loaded via a namespace (and not attached):
  [1] RColorBrewer_1.1-3     rstudioapi_0.16.0     
  [3] jsonlite_1.8.8         magrittr_2.0.3        
  [5] spatstat.utils_3.0-5   farver_2.1.1          
  [7] fs_1.6.3               vctrs_0.6.5           
  [9] ROCR_1.0-11            memoise_2.0.1         
 [11] spatstat.explore_3.2-6 usethis_2.2.3         
 [13] htmltools_0.5.8.1      curl_5.1.0            
 [15] sass_0.4.9             sctransform_0.4.1     
 [17] parallelly_1.37.1      KernSmooth_2.23-22    
 [19] bslib_0.7.0            desc_1.4.3            
 [21] htmlwidgets_1.6.4      ica_1.0-3             
 [23] fontawesome_0.5.2      plyr_1.8.9            
 [25] plotly_4.10.4          zoo_1.8-12            
 [27] cachem_1.0.8           igraph_2.0.2          
 [29] mime_0.12              lifecycle_1.0.4       
 [31] pkgconfig_2.0.3        Matrix_1.6-5          
 [33] R6_2.5.1               fastmap_1.1.1         
 [35] fitdistrplus_1.1-11    future_1.33.2         
 [37] shiny_1.8.1.1          digest_0.6.35         
 [39] colorspace_2.1-0       ps_1.7.7              
 [41] tensor_1.5             RSpectra_0.16-1       
 [43] irlba_2.3.5.1          pkgload_1.3.4         
 [45] labeling_0.4.3         progressr_0.14.0      
 [47] fansi_1.0.6            spatstat.sparse_3.0-3 
 [49] httr_1.4.7             polyclip_1.10-6       
 [51] abind_1.4-5            compiler_4.3.3        
 [53] remotes_2.5.0          withr_3.0.0           
 [55] fastDummies_1.7.3      pkgbuild_1.4.4        
 [57] highr_0.10             MASS_7.3-60           
 [59] sessioninfo_1.2.2      tools_4.3.3           
 [61] lmtest_0.9-40          httpuv_1.6.15         
 [63] future.apply_1.11.2    goftest_1.2-3         
 [65] glue_1.7.0             callr_3.7.6           
 [67] nlme_3.1-164           promises_1.3.0        
 [69] grid_4.3.3             Rtsne_0.17            
 [71] cluster_2.1.6          reshape2_1.4.4        
 [73] generics_0.1.3         gtable_0.3.4          
 [75] spatstat.data_3.0-4    tidyr_1.3.1           
 [77] utf8_1.2.4             spatstat.geom_3.2-9   
 [79] RcppAnnoy_0.0.22       ggrepel_0.9.5         
 [81] RANN_2.6.1             pillar_1.9.0          
 [83] stringr_1.5.1          spam_2.10-0           
 [85] RcppHNSW_0.6.0         limma_3.58.1          
 [87] later_1.3.2            splines_4.3.3         
 [89] lattice_0.22-6         survival_3.5-8        
 [91] deldir_2.0-4           tidyselect_1.2.1      
 [93] miniUI_0.1.1.1         pbapply_1.7-2         
 [95] downlit_0.4.4          gridExtra_2.3         
 [97] bookdown_0.40          distill_1.6           
 [99] scattermore_1.2        xfun_0.43             
[101] statmod_1.5.0          devtools_2.4.5        
[103] matrixStats_1.1.0      stringi_1.8.3         
[105] lazyeval_0.2.2         yaml_2.3.8            
[107] evaluate_0.23          codetools_0.2-20      
[109] tibble_3.2.1           cli_3.6.2             
[111] uwot_0.1.16            xtable_1.8-4          
[113] processx_3.8.4         munsell_0.5.1         
[115] jquerylib_0.1.4        globals_0.16.3        
[117] spatstat.random_3.2-3  png_0.1-8             
[119] parallel_4.3.3         ellipsis_0.3.2        
[121] presto_1.0.0           dotCall64_1.1-1       
[123] profvis_0.3.8          urlchecker_1.0.1      
[125] listenv_0.9.1          viridisLite_0.4.2     
[127] scales_1.3.0           ggridges_0.5.6        
[129] leiden_0.4.3.1         purrr_1.0.2           
[131] rlang_1.1.3            cowplot_1.1.3