Bioinformatics training course: bulk RNA-seq: (DNB)动态网络标志物分析

全局设定

我们将使用系统内安装的R：

# bash命令
export PATH="/sibcb1/bioinformatics/biocourse/biocourse01/miniforge3/envs/scRNA/bin:$PATH"

加载需要使用的软件包：

## devtools::install_github("gpli/DNB")
library(DNB) ##加载R包DNB

表达数据和样本信息准备

# Load data
gene_expr <- read.csv(system.file("extdata", "gene_expr.csv", package = "DNB"), row.names = 1)
sample_info <- read.csv(system.file("extdata", "sample_info.csv", package = "DNB"), row.names = 1)
head(gene_expr[, 1:5])

           s_1         s_2         s_3        s_4         s_5
g_1  86.090028  0.01700653  21.8332070 109.768187  118.198947
g_2  63.159675  0.78383604 641.5469855   1.183683 1047.381835
g_3  27.923068  0.11859829 431.1271239   7.960984   92.287214
g_4 249.504697 28.04153114   0.7283007 126.189366    0.088603
g_5   0.261581  0.23176937   1.9417726  68.252337   16.135678
g_6   0.000000  0.00000000   0.0000000  21.311293    8.737244

head(sample_info)

    time group
s_1    6     0
s_2    6     1
s_3    6     1
s_4    6     0
s_5    3     1
s_6    3     0

# New DNB object
# all-zero genes at any time point will be removed
dnb <- new_DNB(data = gene_expr, time = as.factor(sample_info$time))
dnb

$data
1818 x 74 matrix of doubles: 

               s_1         s_2         s_3 ...        s_74
g_1     86.0900277   0.0170065  21.8332070 ... 112.5990364
g_2     63.1596748   0.7838360 641.5469855 ...   2.2705360
g_3     27.9230676   0.1185983 431.1271239 ...   3.4109098
...            ...         ...         ... ...         ...
g_1873 487.4775445 143.9325181  17.5177474 ... 390.7652305

$time
 [1] 6 6 6 6 3 3 2 2 3 3 4 4 4 4 2 2 2 2 2 2 4 4 6 6 6 6 6 6 6 6 4 4 5
[34] 5 2 2 6 6 3 3 3 3 5 5 5 5 5 5 5 5 5 5 6 6 6 6 6 6 4 4 3 3 5 5 4 4
[67] 4 4 6 6 6 6 6 6
Levels: 2 3 4 5 6

$group
NULL

鉴定DNB

dnb <- cal_cor(dnb) ##计算不同分组样本内基因的相关性系数

dnb <- cal_cv(dnb) ##计算不同分组样本内基因的变异系数

##变异系数（Coefficient of Variation）:标准差与平均数的比
##当需要比较两组数据离散程度大小的时候，如果两组数据的测量尺度相差太大，
##或者数据量纲的不同，直接使用标准差来进行比较不合适，此时就应当消除测量尺度和量纲的影响

dnb <- search_candidates(dnb, min_size = 20, max_size = 100)

time point 2 
Hierarchical clustering genes ...
Calculating module attributes...
time point 3 
Hierarchical clustering genes ...
Calculating module attributes...
time point 4 
Hierarchical clustering genes ...
Calculating module attributes...
time point 5 
Hierarchical clustering genes ...
Calculating module attributes...
time point 6 
Hierarchical clustering genes ...
Calculating module attributes...

## 对基因表达进行聚类分成不同的基因module, 并计算每个gene module的DNB score
## 找到每个时间点DNB Score最高的基因module
## score <- cv_in * cor_in / cor_out
dnb <- cal_final(dnb)
##Compare and get the time point with highest score, and use that set of
##DNB genes to recalculate DNB attibutes for all time points.

candidates <- get_candidates(dnb)
plot_DNB(candidates)

最终结果：

dnb_genes <- get_DNB_genes(dnb)
final <- get_final(dnb)
plot_DNB(final)

设置对照

dnb <- new_DNB(data = gene_expr,
               time = as.factor(sample_info$time),
               group = as.factor(sample_info$group))
dnb <- cal_cor(dnb)

2 (case)
2 (control)
3 (case)
3 (control)
4 (case)
4 (control)
5 (case)
5 (control)
6 (case)
6 (control)

dnb <- cal_cv(dnb)

2 (case)
2 (control)
3 (case)
3 (control)
4 (case)
4 (control)
5 (case)
5 (control)
6 (case)
6 (control)

dnb

$data
1685 x 74 matrix of doubles: 

               s_1         s_2         s_3 ...        s_74
g_1     86.0900277   0.0170065  21.8332070 ... 112.5990364
g_2     63.1596748   0.7838360 641.5469855 ...   2.2705360
g_3     27.9230676   0.1185983 431.1271239 ...   3.4109098
...            ...         ...         ... ...         ...
g_1873 487.4775445 143.9325181  17.5177474 ... 390.7652305

$time
 [1] 6 6 6 6 3 3 2 2 3 3 4 4 4 4 2 2 2 2 2 2 4 4 6 6 6 6 6 6 6 6 4 4 5
[34] 5 2 2 6 6 3 3 3 3 5 5 5 5 5 5 5 5 5 5 6 6 6 6 6 6 4 4 3 3 5 5 4 4
[67] 4 4 6 6 6 6 6 6
Levels: 2 3 4 5 6

$group
 [1] 0 1 1 0 1 0 1 0 0 1 0 1 0 1 0 1 0 1 1 0 0 1 1 0 0 1 1 0 0 1 0 1 0
[34] 1 0 1 0 1 1 0 1 0 1 0 0 1 0 1 1 0 1 0 0 1 1 0 1 0 1 0 1 0 0 1 0 1
[67] 1 0 1 0 0 1 1 0
Levels: 0 1

$correlation
list of length 5 
[[1]] 
time point "2"
1685 x 1685 matrix of doubles: 

              g_1        g_2        g_3 ...     g_1873
g_1     1.0000000 -0.3939425 -0.1021136 ... -0.4051167
g_2    -0.3939425  1.0000000 -0.5705402 ...  0.9700166
g_3    -0.1021136 -0.5705402  1.0000000 ... -0.4061931
...           ...        ...        ... ...        ...
g_1873 -0.4051167  0.9700166 -0.4061931 ...  1.0000000

...

$correlation_ctrl
list of length 5 
[[1]] 
time point "2"
1685 x 1685 matrix of doubles: 

              g_1        g_2        g_3 ...     g_1873
g_1     1.0000000 -0.3781295 -0.5869382 ... -0.2721603
g_2    -0.3781295  1.0000000  0.6430009 ... -0.5344707
g_3    -0.5869382  0.6430009  1.0000000 ...  0.4171144
...           ...        ...        ... ...        ...
g_1873 -0.2721603 -0.5344707  0.4171144 ...  1.0000000

...

$CV
list of length 5 
[[1]] 
time point "2"
     g_1      g_2      g_3            g_1873 
1.003978 1.100766 2.021448      ... 1.614518 

...

$CV_ctrl
list of length 5 
[[1]] 
time point "2"
     g_1      g_2      g_3            g_1873 
2.293950 1.146814 1.076170      ... 1.990480 

...

这里的Score会根据ctrl的score进行校准:

##找到每个时间点DNB score最高的基因module
## score <- cv_in * cor_in / cor_out
## score <- score / cv_in_ctrl * cor_out_ctrl / cor_in_ctrl
dnb <- search_candidates(dnb, min_size = 50, max_size = 500)

time point 2 
Hierarchical clustering genes ...
Calculating module attributes...
time point 3 
Hierarchical clustering genes ...
Calculating module attributes...
time point 4 
Hierarchical clustering genes ...
Calculating module attributes...
time point 5 
Hierarchical clustering genes ...
Calculating module attributes...
time point 6 
Hierarchical clustering genes ...
Calculating module attributes...

dnb <- cal_final(dnb)
##Compare and get the time point with highest score, and use that set of
##DNB genes to recalculate DNB attibutes for all time points.
candidates <- get_candidates(dnb)
plot_DNB(candidates)

最终结果的可视化：

dnb_genes <- get_DNB_genes(dnb)
final <- get_final(dnb)
plot_DNB(final)

致谢

本文档由陈洛南组唐诗婕博士构建，如需帮助，请联系bioinformatics[at]sibcb.ac.cn获得唐博士的联系方式。

Session Info

sessionInfo()

R version 4.3.3 (2024-02-29)
Platform: x86_64-conda-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux Server 7.6 (Maipo)

Matrix products: default
BLAS/LAPACK: /sibcb1/bioinformatics/biocourse/biocourse01/miniforge3/envs/scRNA/lib/libopenblasp-r0.3.27.so;  LAPACK version 3.12.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: Asia/Shanghai
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods  
[7] base     

other attached packages:
[1] DNB_0.1.1      knitr_1.46     rmarkdown_2.26

loaded via a namespace (and not attached):
 [1] dendextend_1.17.1 gtable_0.3.4      jsonlite_1.8.8   
 [4] highr_0.10        dplyr_1.1.4       compiler_4.3.3   
 [7] tidyselect_1.2.1  gridExtra_2.3     jquerylib_0.1.4  
[10] scales_1.3.0      yaml_2.3.8        fastmap_1.1.1    
[13] ggplot2_3.5.0     R6_2.5.1          labeling_0.4.3   
[16] generics_0.1.3    viridis_0.6.5     tibble_3.2.1     
[19] bookdown_0.40     distill_1.6       munsell_0.5.1    
[22] bslib_0.7.0       pillar_1.9.0      rlang_1.1.3      
[25] utf8_1.2.4        cachem_1.0.8      ramify_0.3.3     
[28] xfun_0.43         sass_0.4.9        viridisLite_0.4.2
[31] memoise_2.0.1     cli_3.6.2         withr_3.0.0      
[34] magrittr_2.0.3    digest_0.6.35     grid_4.3.3       
[37] rstudioapi_0.16.0 fontawesome_0.5.2 cowplot_1.1.3    
[40] lifecycle_1.0.4   vctrs_0.6.5       downlit_0.4.4    
[43] evaluate_0.23     glue_1.7.0        farver_2.1.1     
[46] fansi_1.0.6       colorspace_2.1-0  tools_4.3.3      
[49] pkgconfig_2.0.3   htmltools_0.5.8.1