本课程主要学习如何使用SCENIC从scRNA-seq中构建转录调控网络并衡量转录因子活性的方法。SCENIC有不同语言的实现,我们主要学习pySCENIC的流程及后续处理方法。
我们使用了论文PMID: 33033240中的数据,并从补充材料中下载了成纤维细胞的数据进行了示例。
为了方便,大家可以直接建立数据的软链接:
ln -s /sibcb1/bioinformatics/biocourse/biocourse01/pyscenic/data .
ln -s /sibcb1/bioinformatics/biocourse/biocourse01/pyscenic/db .
# bash: 加载python环境
source /sibcb1/bioinformatics/training_shared/software/python3.8_profile
source /sibcb1/bioinformatics/training_shared/software/pyscenic/bin/activate
# python: read csv file
import loompy as lp
import numpy as np
import pandas as pd
= pd.read_csv('data/fibo_count.csv', index_col = 0, header = 0) # cells in row and genes in columns
x 'display.max_columns', 5)
pd.set_option( x.head()
-2HG FAM138A ... AC007325.4 AC007325.2
MIR1302-1 0 0 ... 0 0
case1_AAACCTGAGACAGAGA-1 0 0 ... 0 0
case1_AAACCTGGTAGTACCT-1 0 0 ... 0 0
case1_AAAGCAAAGCCTTGAT-1 0 0 ... 1 0
case1_AAAGTAGCACAACTGT-1 0 0 ... 0 0 case1_AAAGTAGGTCCATCCT
# python: convert csv to loom
= {"Gene": np.array(x.columns.astype(str))}
row_attrs = {"CellID": np.array(x.index.astype(str))}
col_attrs = np.transpose(x.to_numpy())
y "data/fibo_counts.loom", y, row_attrs,col_attrs) lp.create(
pySCENIC分析流程使用了GENIE3/grnboost2算法来构建转录调控网络。我们可以使用利用GENIE3 R包来熟悉这个步骤。
注意:这一步的输入文件推荐使是没有经过标准化的raw counts。依据文档,TPM/FPKM/RPKM也能得到非常类似的结果。
exprMatr <- matrix(sample(1:10, 100, replace=TRUE), nrow=20)
rownames(exprMatr) <- c(paste0("TF", 1:3), paste0("Gene", 4:20))
colnames(exprMatr) <- paste0("Sample", 1:5)
head(exprMatr)
Sample1 Sample2 Sample3 Sample4 Sample5
TF1 1 1 3 8 9
TF2 9 10 5 10 10
TF3 6 7 6 8 1
Gene4 4 8 7 10 10
Gene5 1 3 2 9 8
Gene6 7 4 7 6 2
GENIE3使用了Random Forest(RF)
、Extra-Trees(ET)
等多种分类方法来构建网络,默认方法是RF。另外一个参数是K,设定的是每个节点上放置的转录因子的数目,默认是总数的平方根(sqrt)。构建调控网络:
regulators <- c("TF1", "TF2", "TF3")
weightMat <- GENIE3(exprMatr, regulators=regulators, treeMethod = "RF", K = "sqrt", nTrees = 1000)
t(weightMat)
TF1 TF2 TF3
Gene10 0.26046080 0.6432276 0.09631159
Gene11 0.30323574 0.4970695 0.19969479
Gene12 0.41477688 0.1925124 0.39271068
Gene13 0.64191006 0.1856691 0.17242081
Gene14 0.24151000 0.1830737 0.57541630
Gene15 0.30631861 0.4820262 0.21165518
Gene16 0.28716345 0.4159889 0.29684762
Gene17 0.07878808 0.1973863 0.72382561
Gene18 0.33470983 0.2784497 0.38684050
Gene19 0.61899908 0.2358820 0.14511894
Gene20 0.09446144 0.5445653 0.36097327
Gene4 0.45961365 0.4094769 0.13090947
Gene5 0.62070699 0.1989046 0.18038843
Gene6 0.30738421 0.2406538 0.45196194
Gene7 0.18644652 0.5819651 0.23158834
Gene8 0.32860487 0.2737631 0.39763207
Gene9 0.28090658 0.2250963 0.49399716
TF1 0.00000000 0.2934562 0.70654381
TF2 0.58790338 0.0000000 0.41209662
TF3 0.95691390 0.0430861 0.00000000
提取其中的网络连接:
linkList <- getLinkList(weightMat, reportMax = 5, threshold = 0) %>% arrange(regulatoryGene)
linkList
regulatoryGene targetGene weight
1 TF1 TF3 0.9569139
2 TF1 Gene13 0.6419101
3 TF2 Gene10 0.6432276
4 TF3 Gene17 0.7238256
5 TF3 TF1 0.7065438
建立一个bash脚本,包含如下代码,这一步运行时间较长:
/sibcb1/bioinformatics/training_shared/software/python3.8_profile
source /sibcb1/bioinformatics/training_shared/software/pyscenic/bin/activate
source
--seed 21 --num_workers 12 --output out/fibo_grn.tsv --method grnboost2 data/fibo_count.loom data/allTFs_hg38.txt
pyscenic grn
deactivate
注意:在这里我们使用默认的方法grnboost2
而不是genie3
,前者在准确性方面和genie3
类似,但速度更快。结束后可以得到转录调控网络:
head data/fibo_grn.tsv
TF target importance
TFAP4 HBA1 22.02777746024214
ZNF681 KCNT2 13.590138401609611
RPS10 NACA 12.71733203305088
RPL6 NACA 12.223798753741157
RPS10 RPS24 11.874945237671447
RPS10 EEF1G 11.591269739813406
RPS4X EEF1G 11.271761373354874
RPL35 RPS24 10.933359810430666
CELF4 SKIDA1 10.738858077268448
这一步所采用的方法见参考文献。步骤大致如下:
运行这一步需要的数据库文件可以从cistargetDBs website下载:
建立一个bash脚本,包含如下代码:
source /sibcb1/bioinformatics/training_shared/software/python3.8_profile
source /sibcb1/bioinformatics/training_shared/software/pyscenic/bin/activate
pyscenic ctx --annotations_fname db/motif/motifs-v9-nr.hgnc-m0.001-o0.0.tbl \
--expression_mtx_fname data/fibo_count.loom \
--mode dask_multiprocessing \
--output out/fibo_ctx.gmt \
--num_workers 12 \
--mask_dropouts \
\
data/fibo_grn.tsv
db/hg38_refseq-r80_10kb_up_and_down_tss.mc9nr.genes_vs_motifs.rankings.featherdeactivate
进行Motif分析时需要关注的几个参数如下:
--auc_threshold
: 有多少比例的基因参与AUC的计算,默认0.05。--nes_threshold
: NES阈值,NES超过这个阈值的Motif会保留参与后续分析,默认是3。--rank_threshold
: 有多少基因参与靶基因的定义,默认是5000。mask_dropouts
: 当转录因子或者靶基因的表达量为0,那么定义成dropouts。定义转录调控模块时:
--thresholds
: 第一种方法,对每个转录因子,选取重要性排在前面的75%的靶基因。--top_n_targets
: 第二种方法,对每个转录因子选取前50。--top_n_regulators
: 第三种方法,对每个基因选取5个转录因子建立一个bash脚本,包含如下代码:
source /sibcb1/bioinformatics/training_shared/software/python3.8_profile
source /sibcb1/bioinformatics/training_shared/software/pyscenic/bin/activate
pyscenic aucell --seed 21 --output out/fibo_aucell.csv --num_workers 12 data/fibo_count.loom data/fibo_ctx.gmt
deactivate
我们将使用系统内安装的R:
# bash命令
# export PATH="/sibcb1/bioinformatics/biocourse/biocourse01/miniforge3/envs/scRNA/bin:$PATH"
source /sibcb/program/install/r-4.0/profile
R
加载R包:
.libPaths('/sibcb1/bioinformatics/training_shared/software/rlib/r-4.0_lib')
suppressPackageStartupMessages(library(Seurat))
suppressPackageStartupMessages(library(ggplot2))
suppressPackageStartupMessages(library(pheatmap))
suppressPackageStartupMessages(library(reshape2))
suppressPackageStartupMessages(library(tidyr))
suppressPackageStartupMessages(library(dplyr))
suppressPackageStartupMessages(library(SCENIC))
suppressPackageStartupMessages(library(AUCell))
suppressPackageStartupMessages(library(GENIE3))
# 细胞类型信息
fibo_obj <- readRDS('data/fibo_sce.rds')
anno_col = fibo_obj[[]] %>% filter(celltype %in% c('iCAF', 'mCAF')) %>% arrange(celltype) %>% select(celltype)
head(anno_col)
celltype
case1_AAACCTGGTAGTACCT-1 iCAF
case1_AAAGCAAAGCCTTGAT-1 iCAF
case1_AAAGTAGGTCCATCCT-1 iCAF
case1_AAATGCCGTTCCACTC-1 iCAF
case1_AACTCCCAGTAGTGCG-1 iCAF
case1_AACTCTTCAGGTTTCA-1 iCAF
# 转录因子活性
auc <- read.table('data/fibo_aucell.csv',header=T,row.names=1,sep=',',stringsAsFactors=F,check.names=F)
auc <- t(auc)
rownames(auc) <- gsub(',','', rownames(auc))
auc <- auc[,rownames(anno_col)]
auc[1:10,1:4]
case1_AAACCTGGTAGTACCT-1 case1_AAAGCAAAGCCTTGAT-1
AHR(+) 0.06809350 0.07413331
ARID3A(+) 0.12661524 0.11878014
ARNT2(+) 0.00000000 0.00000000
ARNTL(+) 0.06979274 0.04075715
ATF1(+) 0.08598658 0.06950509
ATF2(+) 0.03840505 0.03471995
ATF3(+) 0.14634491 0.14841885
ATF4(+) 0.13699868 0.15960309
ATF6B(+) 0.05754098 0.06040984
ATF7(+) 0.06122404 0.02662295
case1_AAAGTAGGTCCATCCT-1 case1_AAATGCCGTTCCACTC-1
AHR(+) 0.07159951 0.07028004
ARID3A(+) 0.05228222 0.09272742
ARNT2(+) 0.00000000 0.00428051
ARNTL(+) 0.06284645 0.04679269
ATF1(+) 0.05903970 0.06791495
ATF2(+) 0.03676230 0.03094945
ATF3(+) 0.15400591 0.15518672
ATF4(+) 0.17246902 0.15633782
ATF6B(+) 0.07336066 0.03784153
ATF7(+) 0.05648087 0.00000000
# 展示示例转录因子
tf <- c('MITF(+)','TWIST2(+)','SOX4(+)','STAT5A(+)','FOXP2(+)','TBX2(+)','NR2F2(+)','EBF1(+)','SP2(+)')
subauc <- auc[tf,]
subauc <- t(scale(t(subauc))) # 需要对每个转录因子进行scale,因为scale函数是对列进行操作。
subauc[subauc > 2] <- 2
subauc[subauc < -2] <- -2
subauc[1:5,1:5]
case1_AAACCTGGTAGTACCT-1 case1_AAAGCAAAGCCTTGAT-1
MITF(+) 0.11425606 0.66205324
TWIST2(+) 0.31120910 0.34109757
SOX4(+) 0.04375064 -0.37166083
STAT5A(+) 0.38263889 0.02956138
FOXP2(+) 1.10291254 1.40900673
case1_AAAGTAGGTCCATCCT-1 case1_AAATGCCGTTCCACTC-1
MITF(+) -0.3632182 0.84958091
TWIST2(+) -0.1603454 0.99027496
SOX4(+) -0.7082675 1.21529322
STAT5A(+) 0.8568390 -0.02374607
FOXP2(+) 1.5342786 1.95459129
case1_AACTCCCAGTAGTGCG-1
MITF(+) -0.1024826
TWIST2(+) -0.6749393
SOX4(+) -0.4833695
STAT5A(+) 0.1545640
FOXP2(+) 0.6348435
ann_colors <- list(celltype = c(iCAF="red", mCAF="blue"))
pheatmap(subauc, cluster_cols=F, annotation_col=anno_col, fontsize=13, show_colnames=F,
color=colorRampPalette(colors=c("blue","yellow"))(10), annotation_colors=ann_colors)
par(mfrow=c(5,2))
cells_assignment <- AUCell_exploreThresholds(auc[tf, ], plotHist=T, assignCells=T)
lapply(cells_assignment[["EBF1(+)"]], head)
$aucThr
$aucThr$selected
minimumDens
0.1034299
$aucThr$thresholds
threshold nCells
Global_k1 0.12913467 1108
L_k2 0.11513115 1552
R_k3 0.05589582 3739
minimumDens 0.10342991 1856
$aucThr$comment
[1] ""
$assignment
[1] "case1_CACCTTGCACGAAACG-1" "case1_CGGTTAAAGATGTTAG-1"
[3] "case1_TACACGAAGGGTCTCC-1" "case1_TAGACCAGTTCAACCA-1"
[5] "case1_TGAGCATAGGTGCTTT-1" "case1_TGCGTGGCAACGATCT-1"
理想情况下,转录因子在一部分细胞中是激活的,在一部分细胞中是抑制的,那么总体上AUC是两个分布的混合,我们会观测到双峰分布。在这些图中,蓝色垂直线标记的双峰分布的转折点(Inflection point),属于比较好的cutoff。在实际数据中,并不是所有的转录因子的AUC值都呈现明显的双峰分布。
红色的和粉红色曲线分别显示的是估计出来的左边的分布和右边的分布,一般假设左边的分布是转录因子低活性的部分,根据AUCell_exploreThresholds
的thrp
参数(默认值是0.01)确定转录因子高活性的cutoff,显示为红色的垂直线。
example = as.matrix(data.frame(TF1 = as.numeric(anno_col$celltype == 'iCAF'),
TF2 = as.numeric(anno_col$celltype == 'iCAF')*0.3,
TF3 = 1,
TF4 = abs(rnorm(nrow(anno_col)))))
rownames(example) = rownames(anno_col)
head(example)
TF1 TF2 TF3 TF4
case1_AAACCTGGTAGTACCT-1 1 0.3 1 0.5107099
case1_AAAGCAAAGCCTTGAT-1 1 0.3 1 1.9146542
case1_AAAGTAGGTCCATCCT-1 1 0.3 1 0.4721866
case1_AAATGCCGTTCCACTC-1 1 0.3 1 0.3832689
case1_AACTCCCAGTAGTGCG-1 1 0.3 1 0.7777998
case1_AACTCTTCAGGTTTCA-1 1 0.3 1 1.7381543
iCAF mCAF
TF1 1.0000000 0.0000000
TF2 1.0000000 0.0000000
TF3 0.3804296 0.5040529
TF4 0.3339728 0.4333758
mCAF
的细胞数更多,所以RSS值更大。RSS不是一个理想的统计量,推荐只在同一个细胞类型内用RSS对转录因子进行排序,并结合转录因子的平均AUC来筛选重要的结果。
rss <- calcRSS(AUC=auc, cellAnnotation=anno_col[,"celltype"])
t1 = data.frame(rss) %>% arrange(-iCAF) %>% head(5)
t2 = data.frame(rss) %>% arrange(-mCAF) %>% head(5)
rbind(t1, t2)
iCAF mCAF
MITF(+) 0.5288740 0.2753520
TWIST2(+) 0.5106512 0.3505843
SOX4(+) 0.4796627 0.3833229
STAT5A(+) 0.4781993 0.3259399
FOXP2(+) 0.4778039 0.3883962
TBX2(+) 0.1494905 0.6691430
HES5(+) 0.2104092 0.6267840
NR2F2(+) 0.2414395 0.6266321
EBF1(+) 0.2798018 0.5937056
SP2(+) 0.2824488 0.5831278
plotRSS_oneSet(rss,'iCAF') + theme(text=element_text(size=15), axis.text=element_text(size=13))
plotRSS_oneSet(rss,'mCAF') + theme(text=element_text(size=15), axis.text=element_text(size=13))
tfname <- gsub('(+)','_tf',rownames(auc), fixed=T)
fibo_obj <- AddMetaData(fibo_obj, t(auc), make.names(tfname))
Idents(fibo_obj) <- 'celltype'
fibo_obj@meta.data[1:5,1:10]
orig.ident nCount_RNA nFeature_RNA
case1_AAACCTGAGACAGAGA-1 case1 5539 1839
case1_AAACCTGGTAGTACCT-1 case1 6779 2205
case1_AAAGCAAAGCCTTGAT-1 case1 5887 2144
case1_AAAGTAGCACAACTGT-1 case1 29771 5408
case1_AAAGTAGGTCCATCCT-1 case1 6861 2360
percent.mt RNA_snn_res.0.05 seurat_clusters
case1_AAACCTGAGACAGAGA-1 0 0 0
case1_AAACCTGGTAGTACCT-1 0 3 3
case1_AAAGCAAAGCCTTGAT-1 0 3 3
case1_AAAGTAGCACAACTGT-1 0 0 0
case1_AAAGTAGGTCCATCCT-1 0 3 3
celltype AHR_tf ARID3A_tf ARNT2_tf
case1_AAACCTGAGACAGAGA-1 mCAF 0.04619708 0.04413372 0
case1_AAACCTGGTAGTACCT-1 iCAF 0.06809350 0.12661524 0
case1_AAAGCAAAGCCTTGAT-1 iCAF 0.07413331 0.11878014 0
case1_AAAGTAGCACAACTGT-1 mCAF 0.05245754 0.09525072 0
case1_AAAGTAGGTCCATCCT-1 iCAF 0.07159951 0.05228222 0
p1 <- DimPlot(fibo_obj, pt.size=1.5, label=T, label.size=5) + theme(text=element_text(size=18), legend.position='none')
p2 <- FeaturePlot(fibo_obj, features='MITF_tf', pt.size=1.5) + theme(text=element_text(size=18))
p3 <- FeaturePlot(fibo_obj, features='TBX2_tf', pt.size=1.5) + theme(text=element_text(size=18))
p1 + p2 + p3
运行本教程的R环境:
R version 4.0.0 (2020-04-24)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux Server 7.7 (Maipo)
Matrix products: default
BLAS: /sibcb/program/install/r-4.0/lib64/R/lib/libRblas.so
LAPACK: /sibcb/program/install/r-4.0/lib64/R/lib/libRlapack.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods
[7] base
other attached packages:
[1] ggrepel_0.9.1 BiocParallel_1.24.1 knitr_1.48
[4] rmarkdown_2.27 GENIE3_1.12.0 AUCell_1.12.0
[7] SCENIC_1.3.1 dplyr_1.0.9 tidyr_1.2.0
[10] reshape2_1.4.4 pheatmap_1.0.12 ggplot2_3.3.6
[13] sp_1.5-0 SeuratObject_4.1.0 Seurat_4.1.1
loaded via a namespace (and not attached):
[1] plyr_1.8.7 igraph_1.3.4
[3] lazyeval_0.2.2 GSEABase_1.52.1
[5] splines_4.0.0 listenv_0.8.0
[7] scattermore_0.8 GenomeInfoDb_1.26.7
[9] digest_0.6.29 foreach_1.5.2
[11] htmltools_0.5.3 fansi_1.0.3
[13] magrittr_2.0.3 memoise_2.0.1
[15] tensor_1.5 cluster_2.1.4
[17] mixtools_1.2.0 ROCR_1.0-11
[19] globals_0.16.0 annotate_1.68.0
[21] matrixStats_0.62.0 R.utils_2.12.0
[23] spatstat.sparse_2.1-1 colorspace_2.0-3
[25] blob_1.2.3 xfun_0.46
[27] crayon_1.5.1 RCurl_1.98-1.8
[29] jsonlite_1.8.0 graph_1.68.0
[31] progressr_0.10.1 spatstat.data_2.2-0
[33] iterators_1.0.14 survival_3.4-0
[35] zoo_1.8-10 glue_1.6.2
[37] polyclip_1.10-0 gtable_0.3.0
[39] zlibbioc_1.36.0 XVector_0.30.0
[41] leiden_0.4.2 DelayedArray_0.16.3
[43] kernlab_0.9-31 future.apply_1.9.0
[45] BiocGenerics_0.36.1 abind_1.4-5
[47] scales_1.2.1 DBI_1.1.3
[49] spatstat.random_2.2-0 miniUI_0.1.1.1
[51] Rcpp_1.0.9 viridisLite_0.4.1
[53] xtable_1.8-4 reticulate_1.25
[55] spatstat.core_2.4-4 bit_4.0.4
[57] stats4_4.0.0 htmlwidgets_1.5.4
[59] httr_1.4.4 RColorBrewer_1.1-3
[61] ellipsis_0.3.2 ica_1.0-3
[63] farver_2.1.1 pkgconfig_2.0.3
[65] XML_3.99-0.10 R.methodsS3_1.8.2
[67] sass_0.4.2 uwot_0.1.14
[69] deldir_1.0-6 utf8_1.2.2
[71] labeling_0.4.2 tidyselect_1.1.2
[73] rlang_1.0.4 later_1.3.0
[75] AnnotationDbi_1.52.0 munsell_0.5.0
[77] tools_4.0.0 cachem_1.0.6
[79] cli_3.1.0 generics_0.1.3
[81] RSQLite_2.2.16 ggridges_0.5.3
[83] evaluate_0.16 stringr_1.4.1
[85] fastmap_1.1.0 yaml_2.3.5
[87] goftest_1.2-3 bit64_4.0.5
[89] fitdistrplus_1.1-8 purrr_0.3.4
[91] RANN_2.6.1 pbapply_1.5-0
[93] future_1.27.0 nlme_3.1-159
[95] mime_0.12 R.oo_1.25.0
[97] rstudioapi_0.14 compiler_4.0.0
[99] plotly_4.10.0 png_0.1-7
[101] spatstat.utils_2.3-1 tibble_3.1.8
[103] bslib_0.4.0 stringi_1.7.8
[105] highr_0.11 rgeos_0.5-9
[107] lattice_0.20-45 Matrix_1.4-1
[109] vctrs_0.4.1 pillar_1.8.1
[111] lifecycle_1.0.1 jquerylib_0.1.4
[113] spatstat.geom_2.4-0 lmtest_0.9-40
[115] RcppAnnoy_0.0.19 data.table_1.14.2
[117] cowplot_1.1.1 bitops_1.0-7
[119] irlba_2.3.5 httpuv_1.6.3
[121] patchwork_1.1.2 GenomicRanges_1.42.0
[123] R6_2.5.1 bookdown_0.39
[125] promises_1.2.0.1 KernSmooth_2.23-20
[127] gridExtra_2.3 IRanges_2.24.1
[129] parallelly_1.32.1 codetools_0.2-18
[131] distill_1.6 MASS_7.3-58.1
[133] assertthat_0.2.1 fontawesome_0.5.2
[135] SummarizedExperiment_1.20.0 withr_2.5.0
[137] sctransform_0.3.4 S4Vectors_0.28.1
[139] GenomeInfoDbData_1.2.4 mgcv_1.8-40
[141] parallel_4.0.0 grid_4.0.0
[143] rpart_4.1.16 segmented_1.6-0
[145] MatrixGenerics_1.2.1 downlit_0.4.2
[147] Rtsne_0.16 Biobase_2.50.0
[149] shiny_1.7.2