Bulk RNA-seq转录调控网络的构建以及转录因子活性分析

背景

我们这里的转录调控网络是指转录因子和靶基因所组成的有向、有权重的网络。有很多方法可以构建此类网络,在这次课程,我们主要学习ARACNE。该方法使用互信息(Mutual information)来衡量转录因子和靶基因之间的的相关性,不仅能捕捉线性相关,也能发现非线性的相关性。

MI

MI

上图中,不同类型的相关性都能被MI识别。

DPI

DPI(Data Processing Inequality)基于信息论中的一个定理,该定理指出:如果变量A、B、C形成一个马尔可夫链A->B->C,那么A和C之间的互信息不会超过A和B之间或B和C之间的互信息。DPI用于识别和消除网络中的间接相互作用。例如,如果基因A调控基因B,基因B调控基因C,DPI可以帮助识别出A和C之间的关系可能是间接的。

在上图中,A和C之间会被判定为间接作用。

全局设定

我们将使用系统内安装的R:

/sibcb/program/install/r-4.0/bin/R

设定R的library path:

.libPaths('/sibcb1/bioinformatics/training_shared/software/rlib/r-4.0_lib/')
library("bcellViper")
library("viper")
library("bcellViper")
library("fgsea")
library("ggplot2")

准备文件:

ln -s /sibcb1/bioinformatics/biocourse/biocourse01/network/raw .
ln -s /sibcb1/bioinformatics/biocourse/biocourse01/network_lesson2/out/aracne .

Input

运行ARACNE需要两个文件,表达谱矩阵和转录因子列表。

表达谱矩阵

ex = read.table(file = "raw/GSE6477.txt", row.names = 1, header = TRUE, sep = "\t")
ex[1:10, 1:4]
       GSM148911_ReMM GSM148912_MGUS GSM148914_NewMM GSM148915_NewMM
RFC2         5.102640       4.813578        5.445759        4.977311
HSPA6        6.466330       6.772655        6.467316        7.720870
PAX8         6.249809       6.425345        6.111605        6.553714
GUCA1A       5.871829       5.771262        5.657366        5.667482
THRA         5.435354       5.555236        5.528174        5.142718
PTPN21       5.857098       6.145216        5.890093        5.564663
CCL5         5.286222       5.288690        4.992404        5.128804
CYP2E1       5.427245       5.872459        5.490823        5.463017
EPHB3        7.704572       7.967523        7.560548        7.352723
ESRRA        8.096653       8.393159        7.903102        7.762751

转录因子列表

head raw/Regulator.txt 
AATF
ADNP
ADNP2
AEBP1
AFF3
AHCTF1
AHR
APP
ARID4A
ARNT

运行ARACNE

计算cutotff

java -Xmx16g -jar /sibcb1/bioinformatics/training_shared/software/Aracne.jar \
-e raw/GSE6477.txt \
-o out \
--tfs raw/Regulator.txt \
--pvalue 1E-8 \
--seed 1 \
--calculateThreshold
cat aracne/miThreshold_p1E-8_samples162.txt 
0.1586100591876638

Bootstrap

for i in {1..100}; do 
  java -Xmx16g -jar /sibcb1/bioinformatics/training_shared/software/Aracne.jar \
  -e raw/GSE6477.txt \
  -o aracne \
  --tfs raw/Regulator.txt \
  --pvalue 1E-8 \
  --seed ${i} 
done

Consolidation

java -Xmx16g -jar /sibcb1/bioinformatics/training_shared/software/Aracne.jar \
-o aracne \
--consolidate

运行得到的结果:

/sibcb1/bioinformatics/biocourse/biocourse01/miniforge3/bin/tree aracne/
aracne/
├── bootstrapNetwork_1o16trgvqmt17tec7b1ma8njof.txt
├── bootstrapNetwork_1paet1fa307s0v7c9clkigh9eh.txt
├── bootstrapNetwork_1rjmsnbk3aiqq0scre9qro8uki.txt
├── bootstrapNetwork_1ssusd9u3ktlj2ld5ftp402k2k.txt
├── bootstrapNetwork_31hatvjvp5vku1ck9grn8sebs6.txt
├── bootstrapNetwork_32qit5ga1gajn33kjhflhk81a7.txt
├── bootstrapNetwork_333qtbck1qleg4qktj3rqs1mg9.txt
├── bootstrapNetwork_34d2shaua4099mjl7knq33rc6b.txt
├── bootstrapNetwork_4aamtpja7vd3dn1slm9mgo0pdu.txt
├── bootstrapNetwork_4cjutffk09o26oosvotopfqek0.txt
├── bootstrapNetwork_4dt6sldu8j2svqht1phr27i421.txt
├── bootstrapNetwork_4j4qtpg1fqhu2ogb0itbji9jng.txt
├── bootstrapNetwork_5ldat9euf35cmee53vbs1bcs5o.txt
├── bootstrapNetwork_5nmisvd8fdgbfg55m0vu9j4hjp.txt
├── bootstrapNetwork_5su6t3hbmjvcie7jcobertq0p8.txt
├── bootstrapNetwork_5u7et9flmtabbfsjmqvh45lmfa.txt
├── bootstrapNetwork_6v6mt3e8ltiv5k3dg5pv8mt9ng.txt
├── bootstrapNetwork_70fuspaim7tpv5qdq7e1humuti.txt
├── bootstrapNetwork_75eatnibl3208i5rmt5fqhip4v.txt
├── bootstrapNetwork_76nitdelldcv1jqrovpi39ceb0.txt
├── bootstrapNetwork_770qt3cvtnnpqljs30dgc163p2.txt
├── bootstrapNetwork_8n1uthgd3nkdd99idv745ne0g9.txt
├── bootstrapNetwork_8ob6t7en41v86b0ig1r6ef7lub.txt
├── bootstrapNetwork_8qkestd14ba6vcniq2f8n71bcc.txt
├── bootstrapNetwork_8rtms39bcll1oeej443b0ep0ie.txt
├── bootstrapNetwork_a1ratbhn2g1vsusqi6lbd2ue21.txt
├── bootstrapNetwork_a24it1c1arcqlglr4799mao383.txt
├── bootstrapNetwork_a3dqsnabb5nleicre9tfuihou4.txt
├── bootstrapNetwork_a8c6tli4a1rvouk8r0kq7lfith.txt
├── bootstrapNetwork_b9betvin104jj2r2sbf8c6p65o.txt
├── bootstrapNetwork_bakmt5f19afec4i36d3alegrbp.txt
├── bootstrapNetwork_bctusrdb9kq956b38enctmagpr.txt
├── bootstrapNetwork_bi5itvhe8q9e84bhf62tfh00fa.txt
├── bootstrapNetwork_bjeqt5foh4k9162hh8mvo8pllb.txt
├── bootstrapNetwork_cke2tfeb84ssrq9bajhdsq38ti.txt
├── bootstrapNetwork_clnaslclge7rkrubkl5c5hsubj.txt
├── bootstrapNetwork_cruut9gonkmonq0pjdgsncidh2.txt
├── bootstrapNetwork_ct86tff2nu1ngrnpte5304a374.txt
├── bootstrapNetwork_cuheslbco8cm9teq7gp18s5od5.txt
├── bootstrapNetwork_dvgmsvbvn7l641lk0rjjdddblb.txt
├── bootstrapNetwork_e4f2ttjom3pcedv1lib1m0b5so.txt
├── bootstrapNetwork_e5oat3g2ue4b7fk1vkv3uo2r2q.txt
├── bootstrapNetwork_e61it9ccuof601d29lj67vsg8s.txt
├── bootstrapNetwork_f7qthhlrsfio7p3r4jj2d56en.txt
├── bootstrapNetwork_fd8etnh2st6utjia1pp0trtj6h.txt
├── bootstrapNetwork_ffhmtdfct7hpmlbabqd76jl8ki.txt
├── bootstrapNetwork_fgqusjdn5hskfn2als15fretqk.txt
├── bootstrapNetwork_gh2t7fvs6qdh9g4567hb4ursp.txt
├── bootstrapNetwork_h0s2t1h4bhp82ao0orqp8hmqpr.txt
├── bootstrapNetwork_h15at7debr46rcf12serhpeg7t.txt
├── bootstrapNetwork_h2eistboc5f1ke61cu2tqh85lu.txt
├── bootstrapNetwork_h7cutrjhb1jbtqff1kqc347vlb.txt
├── bootstrapNetwork_h8m6t1frjbu6nc6fbmeecbvkrd.txt
├── bootstrapNetwork_hqastc9sh58ab94f7rnjsmh2q.txt
├── bootstrapNetwork_i9letbeeab6qhgd9d18sgt983j.txt
├── bootstrapNetwork_ibumshcoilhlai49n3sup50thl.txt
├── bootstrapNetwork_ih6atlgrhr0qdg4nlr8fbfmcv3.txt
├── bootstrapNetwork_iifitbf5i5bl6htnvsshjng2d5.txt
├── bootstrapNetwork_ijoqshdfqfmjvjio1ugjsv9nj7.txt
├── bootstrapNetwork_jko2src2hfv7q7phr9b210jb3d.txt
├── bootstrapNetwork_jqvmtfg5gle8t5rvq1miir8q8s.txt
├── bootstrapNetwork_js8ut5efovp7m7h043agrj2fut.txt
├── bootstrapNetwork_jti6srapp942f980e5un4aq54v.txt
├── bootstrapNetwork_kvqmsb9moinh2v7q9hd7hjtdr7.txt
├── bootstrapNetwork_l3fqtjj5v4gojpo7s7gjhv1ici.txt
├── bootstrapNetwork_l4p2t9hfverncrf8684hqmr7qk.txt
├── bootstrapNetwork_l52atfdpvo6i5t688aok3ekt0l.txt
├── bootstrapNetwork_l7bisla403hguuv8ibcmc6cimn.txt
├── bootstrapNetwork_mc96ttgftuub3fdg0dumpajvua.txt
├── bootstrapNetwork_meiet3eq6895s14gifil22dlcc.txt
├── bootstrapNetwork_mfrmspd46ik0l2tgsg6rba5aie.txt
├── bootstrapNetwork_nm2itnhq4nbpil2okkcm164d83.txt
├── bootstrapNetwork_nnbqtde452mkbmroul0oadu2m4.txt
├── bootstrapNetwork_npl2sjcedc1j4ogp0nkqilno46.txt
├── bootstrapNetwork_nquas98odmcdtq9pao8srthda8.txt
├── bootstrapNetwork_o062tdercsrj0o879hkdd86svm.txt
├── bootstrapNetwork_o1fasjb5l66dpq17ji8fmfui5o.txt
├── bootstrapNetwork_o6dmthiuk2ak36al89vtuiscd5.txt
├── bootstrapNetwork_o7mut7f8kclis81liajs7qm1j7.txt
├── bootstrapNetwork_o906tddikm0dlpolsc82gifn18.txt
├── bootstrapNetwork_pavesnc5jm91gdvfln2cl3pa9e.txt
├── bootstrapNetwork_pg72trg8iso6irvtkfe16uepmt.txt
├── bootstrapNetwork_phgat1eir631btmtuh23fm6f4v.txt
├── bootstrapNetwork_pipisncsrgds5ffu0im5oe04b0.txt
├── bootstrapNetwork_qp0etlhipl5l21l60ms0ea170l.txt
├── bootstrapNetwork_qr9mtbfspvgjrja6aog6n1osmn.txt
├── bootstrapNetwork_qsiusha72arekl36kp4509ihsp.txt
├── bootstrapNetwork_s2gitpiio588olje2rm1ddpv4c.txt
├── bootstrapNetwork_s3pqtfgsofj7hn8ecta7m5hkie.txt
├── bootstrapNetwork_s432t5d70pu2b91emuu5utb9of.txt
├── bootstrapNetwork_s6casrbh138t4aoep0i87l4veh.txt
├── bootstrapNetwork_tb9utjht6uln8b6mf244l9acm4.txt
├── bootstrapNetwork_tdj6t9e7790i1svmh3oath2246.txt
├── bootstrapNetwork_tesesvch7jbcqummr5c96otna7.txt
├── bootstrapNetwork_ul3atth75o35ngtur8ibskqq7s.txt
├── bootstrapNetwork_umcit3fhe2e0h2kv5a6a5skfdu.txt
├── bootstrapNetwork_uolqspdrecova4bvfbqce4e4s0.txt
├── bootstrapNetwork_upv2sf85en3q362vhdeenc7q21.txt
├── bootstrapNetwork_uutetdhudi80cicde45t0f5k1e.txt
├── bootstrapNetwork_vvsmt7ghcigk7mj77f0b4gd7hl.txt
├── miThreshold_p1E-8_samples162.txt
├── network_1.txt
└── network.txt

1 directory, 103 files

查看networks

head aracne/network.txt 
Regulator   Target  MI  pvalue
FUBP3   TOPORS  0.4333787339157187  0.0
ZNF593  MRPL4   0.3711633496258149  0.0
TAF12   MRC1    0.274655604191936   0.02477282013150539
CBX2    PRB3    0.3652975573975554  0.02477282013150539
CBX2    PRB4    0.34497953379788715 0.0
CBX2    PRG3    0.32253416742820235 0.02477282013150539
CTBP1   PRKAR1A 0.3515585095031235  6.558268594858419E-8
HMGA2   ADORA2B 0.25260373635361727 0.02477282013150539
ZNF415  DRAP1   0.29908699522267373 0.02477282013150539

导入ARACNE构建的network

network <- read.table(file = "aracne/network.txt", row.names = NULL, header = TRUE, sep = "\t")
head(network)
  Regulator Target        MI     pvalue
1     FUBP3 TOPORS 0.4333787 0.00000000
2    ZNF593  MRPL4 0.3711633 0.00000000
3     TAF12   MRC1 0.2746556 0.02477282
4      CBX2   PRB3 0.3652976 0.02477282
5      CBX2   PRB4 0.3449795 0.00000000
6      CBX2   PRG3 0.3225342 0.02477282

VIPER

构建了网络之后,我们可以用VIPER来进行网络分析,检测两种状态之间转录因子活性的变化。

导入ARACNE构建的网络

ExS = as.matrix(read.table(file = "raw/GSE6477.txt", row.names = 1, header = TRUE, sep = "\t"))
regulon = suppressWarnings(aracne2regulon("aracne/network.txt", ExS))
number of iterations= 335 
head(names(regulon))
[1] "AATF"   "ADNP"   "ADNP2"  "AEBP1"  "AFF3"   "AHCTF1"

查看每个转录因子的靶基因:

head(regulon[["CBX2"]]$tfmode)
      PRB3       PRB4       PRG3       PPIE       PPCS       POGZ 
 0.9862316  0.9970921  0.8368823 -0.9810494 -0.9728839 -0.9175349 
head(regulon[["CBX2"]]$likelihood)
[1] 0.3652976 0.3449795 0.3225342 0.3522193 0.3375608 0.3320467

激活的靶基因:

plot(ExS["CBX2", ], ExS["PRB3", ], xlab = "CBX2", ylab = "PRB3", main = "CBX2 vs PRB3", type = "p", col = "red")

抑制的靶基因:

plot(ExS["CBX2", ], ExS["PPIE", ], xlab = "CBX2", ylab = "PPIE", main = "CBX2 vs PPIE", type = "p", col = "blue")

定义要比较的类别

我们分析多发性骨髓瘤中新发的病例:

idx <- sapply(strsplit(colnames(ExS), "_"), function(x) x[2]) == "NewMM"
Ex <- ExS[, idx]
dim(Ex)
[1] 12494    73

基因水平的均一化:

ExC <- Ex - apply(Ex, 1, mean)
ExC[1:10, 1:3]
       GSM148914_NewMM GSM148915_NewMM GSM148917_NewMM
RFC2        0.29945900     -0.16898876    -0.268771656
HSPA6      -0.36654649      0.88700750     0.377417082
PAX8       -0.23744655      0.20466264    -0.003593696
GUCA1A     -0.27842014     -0.26830384     0.306076143
THRA        0.16143512     -0.22402003    -0.200295535
PTPN21     -0.07940156     -0.40483195     0.088706134
CCL5       -0.24402822     -0.10762812    -0.058526014
CYP2E1      0.03985734      0.01205131     0.156893418
EPHB3       0.11090254     -0.09692283     0.194864154
ESRRA      -0.05772633     -0.19807728    -0.209893059

导入用来分类的标志物:

load("raw/GSE121007_DGE_top.RData")
lapply(sigList[[1]], head)
$UP
[1] "TXNIP"      "RELN"       "AC245060.6" "IL2RB"      "IL32"      
[6] "TRPC4"     

$DN
[1] "CCL4"   "NR4A2"  "FOS"    "CCL4L2" "JUND"   "DUSP1" 

$PG
[1] "DPM1"     "SCYL3"    "C1orf112" "FUCA2"    "GCLC"     "NFYA"    

我们可以使用rank-sum test检验这两个标志物在样本中的富集程度

SimpleRankTest <- function(Pvalue, gsList, size = 4) {
    # The Pvalue is pvalue vector with Entrez ID as names
    # gsList is the gene set list

            GeneID <- names(Pvalue)
            M <- length(gsList)
            zscore <- rep(NaN,M)

            N <- length(Pvalue)
            RankP <- rank(Pvalue,na.last = TRUE)

            for(i in 1:M) {

                    Index <- GeneID %in% gsList[[i]]
                    n1 <- sum( Index )

                    if(n1 < size)
                            next
                    n2 <- N - n1
                    R1 <- sum(RankP[Index])
                    K1 <- R1 - n1*(n1+1)/2
                    Umean <- n1*n2/2
                    Uvar  <- sqrt(n1*n2*(n1+n2+1)/12)
                    zscore[i] <- (K1-Umean)/Uvar
            }

            # return values
            names(zscore) <- names(gsList)
            return(zscore)
    }

计算所有的样本:

gsList <- sigList[[1]][c("UP", "DN")]
M <- ncol(ExC)
zPOS <- rep(0, M)
zNEG <- rep(0, M)

for(i in 1:M){
  
  fcVector <- ExC[,i]
  
  tempZ    <- SimpleRankTest(fcVector, gsList)
  zPOS[i]  <- tempZ[1]
  zNEG[i]  <- tempZ[2]
}
zDIF        <- zPOS - zNEG
names(zDIF) <- colnames(Ex)
zDIF        <- sort(zDIF)

检查富集程度:

cbind(zDIF)
                       zDIF
GSM149010_NewMM -10.7867863
GSM148961_NewMM  -7.6253545
GSM149059_NewMM  -6.7517171
GSM149021_NewMM  -6.3821158
GSM148979_NewMM  -6.0103738
GSM149018_NewMM  -5.7146125
GSM148925_NewMM  -4.0913544
GSM148960_NewMM  -4.0857549
GSM149017_NewMM  -4.0225267
GSM148989_NewMM  -3.4952527
GSM149038_NewMM  -3.4242261
GSM148940_NewMM  -3.3006190
GSM149002_NewMM  -3.0070938
GSM148944_NewMM  -2.5112228
GSM148973_NewMM  -2.4443350
GSM148978_NewMM  -2.3576782
GSM148964_NewMM  -2.2148657
GSM148999_NewMM  -2.2019378
GSM148981_NewMM  -2.0402653
GSM148955_NewMM  -2.0009582
GSM149005_NewMM  -1.8851118
GSM148968_NewMM  -1.8690704
GSM148918_NewMM  -1.5680894
GSM149048_NewMM  -1.5245688
GSM149004_NewMM  -1.5161443
GSM148971_NewMM  -1.5103623
GSM149049_NewMM  -1.1769248
GSM148987_NewMM  -1.1277184
GSM148921_NewMM  -1.0566599
GSM148939_NewMM  -0.8764960
GSM149052_NewMM  -0.5200675
GSM149016_NewMM  -0.2851927
GSM149037_NewMM  -0.2440378
GSM148986_NewMM   0.1161754
GSM148914_NewMM   0.1659326
GSM148997_NewMM   0.2203500
GSM149050_NewMM   0.3441627
GSM148949_NewMM   0.4429702
GSM148956_NewMM   0.4638226
GSM148917_NewMM   0.6611341
GSM148991_NewMM   0.9454948
GSM148915_NewMM   1.0190211
GSM149025_NewMM   1.0496826
GSM148951_NewMM   1.1105393
GSM148946_NewMM   1.1835128
GSM148974_NewMM   1.5657098
GSM149040_NewMM   1.5941324
GSM149009_NewMM   2.0572438
GSM149055_NewMM   2.2552075
GSM148998_NewMM   2.3816011
GSM149054_NewMM   2.4192577
GSM148988_NewMM   2.5358233
GSM149019_NewMM   2.6342345
GSM149031_NewMM   2.6785674
GSM149007_NewMM   2.7504196
GSM148958_NewMM   3.0686307
GSM148947_NewMM   3.1064238
GSM148938_NewMM   3.1728299
GSM149044_NewMM   3.7870587
GSM148922_NewMM   3.9556022
GSM148942_NewMM   4.1914853
GSM148950_NewMM   4.4281523
GSM149006_NewMM   4.6457489
GSM148990_NewMM   5.1434197
GSM148923_NewMM   5.3091977
GSM149061_NewMM   5.6542414
GSM149033_NewMM   6.2964675
GSM148995_NewMM   6.9684832
GSM148993_NewMM   7.0429813
GSM149056_NewMM   7.1122615
GSM148967_NewMM   8.2635866
GSM149008_NewMM   9.2495623
GSM148983_NewMM  11.2821098

根据已知 signature z-score 高、低将 NewMM 样本分为 high/low 两组,并各取其表达谱:

cutoff <- median(zDIF)
hName  <- names(zDIF)[zDIF > cutoff]
lName  <- names(zDIF)[zDIF < cutoff]
hM     <- Ex[, hName]
lM     <- Ex[, lName]

转录因子活性的变化

先比较两类之间表达谱的差异:

signature <- rowTtest(hM, lM)
head(data.frame(signature$statistic, signature$p.value))
       signature.statistic signature.p.value
RFC2            0.65903165       0.512037596
HSPA6          -1.65906375       0.101577316
PAX8           -0.52824854       0.598997606
GUCA1A          0.03782808       0.969932434
THRA           -2.87466306       0.005353084
PTPN21         -0.55973305       0.577448303

将p-value转成Z-score:

signzed_zscore <- (qnorm(signature$p.value/2, lower.tail = FALSE) *sign(signature$statistic))[, 1]
head(signzed_zscore)
       RFC2       HSPA6        PAX8      GUCA1A        THRA 
 0.65566826 -1.63725444 -0.52584255  0.03769303 -2.78498110 
     PTPN21 
-0.55711582 

构建背景:

nullmodel <- ttestNull(hM, lM, per = 1000, repos = TRUE, seed = 1, verbose = FALSE)
nullmodel[1:5,1:5]
                 1          2           3          4          5
RFC2    0.70341900 -2.0957027  0.14256862 -0.3056710 -0.6340376
HSPA6   0.96298940 -0.2053427 -0.08795436 -0.4622612  0.6203836
PAX8   -1.07571418  1.2828265 -0.07530774  1.9562127  0.9869166
GUCA1A -0.04172332  1.2565216 -0.91892153  2.0362804 -0.2365597
THRA   -0.80796274 -0.6313287  0.09873444  0.4074121  1.0040930

检验转录因子活性的变化:

mrs <- msviper(signzed_zscore, regulon, nullmodel, verbose = FALSE)
res <- summary(mrs, 100)
head(res)
       Regulon Size  NES  p.value    FDR
SATB2    SATB2   68 3.63 0.000282 0.0724
MYC        MYC  206 3.25 0.001140 0.0724
HMGA1    HMGA1  278 3.23 0.001250 0.0724
BRD8      BRD8  178 3.21 0.001330 0.0724
TAF1B    TAF1B   97 3.19 0.001400 0.0724
CDKN2A  CDKN2A  167 3.14 0.001670 0.0724

画出最显著的前30个转录因子:

plot(mrs, 30, cex = .9)

使用fgsea检查MYC的富集性:

mycTargets = regulon[["MYC"]]$tfmode
gs = list(activated = names(mycTargets)[mycTargets > 0], repressed = names(mycTargets)[mycTargets < 0])
res = suppressWarnings(fgsea(gs, signzed_zscore, minSize  = 15, maxSize = 500))
res
     pathway         pval         padj   log2err         ES       NES
      <char>        <num>        <num>     <num>      <num>     <num>
1: activated 1.000000e-50 2.000000e-50        NA  0.7554677  3.675944
2: repressed 2.124813e-05 2.124813e-05 0.5756103 -0.5559996 -2.105326
    size  leadingEdge
   <int>       <list>
1:   166 PUS7, NO....
2:    40 APOC1, T....

展示富集结果:

plotEnrichment(gs[["activated"]], signzed_zscore) + labs(title="MYC activated targets")