Intruduction

Genome wide association studies (GWAS) have been successfully used to identify disease-associated variants, however the causal genes in many diseases remain elusive, due to effects such as linkage disequilibrium (LD) between associated variants and long-range regulation. Direct experimental validation of the many potential causal genes is expensive and difficult, so an attractive first step is to prioritize genes with respect their biological relevance. Numerous evidences suggest genes function in pathway-level, so different disease causal genes might function in the same causal pathway. Pathways can be annotated as un-structured gene sets (Reactome), structured tree (GO) or even networks. We have implemented the algorithm of GRAIL (Raychaudhuri 2009) in R as a package NetGPA and provide both text-mining and coexpression networks. We have demonstrated coexpression network-based gene prioritization is more sensitive when gene expression signatures are provided as seeds (Aldo M. Roccaro and Ghobrial 2016). NetORA is an extension of NetGPA to perform network-based pathway enrichment analysis.

Standard workflow

To demonstrate the input and output data format in NetORA, we use example data sets in NetGPA.

library("NetGPA")
library("NetORA")
data("Example_NetGPA")

names(Example_NetGPA)

[1] "CD_GWAS"        "WM_Seed"        "WM_Query"      
[4] "ExE_Hyper"      "CancerPathway"  "Cancer_GeneSet"

Input

Similar as other pathway-enrichment analysis, NetORA expect annotated pathways and signatures as input. Since it’s a network-based method, a network is also required.

Pathways

In vignettes of NetGPA, we have used gene prioritization to identify pathways that drive DNA methylation transformation (Zachary D. Smith 2017). Here we took an more intuitive approach by directly performing network-based pathway enrichment analysis.

# Genes near hyper-methylated CpG Islands in Mouse ExE
ExE_Hyper     <- Example_NetGPA$ExE_Hyper

We treat ExE_Hyper as an annotated pathway.

Signatures

Signatures are gene sets we want to test. Here we use pathways that are frequently mutated in cancers as signatures.

# show example query genes
Cancer_GeneSet   <- Example_NetGPA$Cancer_GeneSet
names(Cancer_GeneSet)

 [1] "SIGNALLING_BY_NGF"                               
 [2] "SIGNALING_BY_SCF_KIT"                            
 [3] "SIGNALING_BY_ERBB4"                              
 [4] "SIGNALING_BY_ERBB2"                              
 [5] "SIGNALING_BY_EGFR_IN_CANCER"                     
 [6] "NGF_SIGNALLING_VIA_TRKA_FROM_THE_PLASMA_MEMBRANE"
 [7] "SIGNALING_BY_FGFR_IN_DISEASE"                    
 [8] "DOWNSTREAM_SIGNAL_TRANSDUCTION"                  
 [9] "CYTOKINE_SIGNALING_IN_IMMUNE_SYSTEM"             
[10] "SIGNALING_BY_FGFR"

Background

Background represents all possible genes evaluated when signatures are defined. For microarray studies, background is all genes measured in a microarray platform; For RNA-seq studies, background is all genes in genome.

Global networks

NetORA us NetGPA a back-end and thus requires networks in the same format, as described below. A gene network is represented as an integer matrix, in which column names are all genes included and each column contains top nearest neighbors of the gene indicated by column name. Here we will use a global text-mining network as an example.

# build a example global gene-network
data(text_2006_12_NetGPA)
networkMatrix <- text_2006_12_NetGPA

dim(networkMatrix)

[1]  1884 18835

networkMatrix[1:10, c("IL12B", "TET2")]

      IL12B  TET2
 [1,]  7408  2361
 [2,]  7475 18712
 [3,]  7453  2851
 [4,]  7410   685
 [5,]  7403 17880
 [6,]  7444 18583
 [7,]  7428 18151
 [8,]  7327   696
 [9,]  7411  1667
[10,]  7459  1808

colnames(networkMatrix)[7408]

[1] "IL12A"

In the example shown above, a network covers 18835 genes and the nearest neighbor of IL12B is shown as 7408, which is the 7408th element of column names (gene IL12A).

Quick Start

Now we have a pathway ExE_Hyper, gene signatures in Cancer_GeneSet and networks in text_2006_12_NetGPA.

# Network-based pathway enrichment analysis
queryTable <- NetORA_Pre(ExE_Hyper, text_2006_12_NetGPA, progressBar = FALSE)

288 regions loaded successfully.
279 regions could be found in database.
18835 genes found in database.

PG <- colnames(text_2006_12_NetGPA)
mergedT <- NetORA_GS(Cancer_GeneSet, queryTable, PG, FDR = 0.05)

mergedT[order(mergedT$pvalue), ]

                                               sgID nSG  nSR nOverlap
10                                SIGNALING_BY_FGFR 108 1368       21
7                      SIGNALING_BY_FGFR_IN_DISEASE 122 1368       22
3                                SIGNALING_BY_ERBB4  87 1368        7
1                                 SIGNALLING_BY_NGF 212 1368       16
6  NGF_SIGNALLING_VIA_TRKA_FROM_THE_PLASMA_MEMBRANE 134 1368        9
4                                SIGNALING_BY_ERBB2  97 1368        6
8                    DOWNSTREAM_SIGNAL_TRANSDUCTION  91 1368        4
2                              SIGNALING_BY_SCF_KIT  75 1368        2
5                       SIGNALING_BY_EGFR_IN_CANCER 105 1368        3
9               CYTOKINE_SIGNALING_IN_IMMUNE_SYSTEM 263 1368        3
       ES       pvalue
10 2.6770 8.493137e-06
7  2.4830 2.030136e-05
3  1.1080 2.968675e-01
1  1.0390 3.718025e-01
6  0.9247 5.120482e-01
4  0.8516 5.632468e-01
8  0.6052 7.994476e-01
2  0.3672 9.168084e-01
5  0.3934 9.520597e-01
9  0.1571 9.999963e-01

You can clearly see that only FGF-related signaling pathways are statistically significant.

Pre-computed networks

Building networks is time-consuming, we thus have pre-computed networks for all canonical pathways defined in MSigDB, using FDR 0.05 as cutoff. They can be loaded as a data set in R data(MSigDB_NetORA_GS). Let’s revisit the example we discussed above. In our previous study (Zachary D. Smith 2017), we have identified a list of potential pathways that regulate DNA methylation of a signature genes ExE_Hyper. In above section, we built a network using ExE_Hyper, and tested enrichment potential pathways. Here, we will do it in a reverse way.

data(MSigDB_NetORA_GS)
PG <- colnames(text_2006_12_NetGPA)
CancerPathway <- paste0("REACTOME_", Example_NetGPA$CancerPathway)
mergedT <- NetORA_MSigDB_CP(ExE_Hyper, MSigDB_NetORA_GS[CancerPathway], PG)

mergedT[order(mergedT$pvalue), ]

                                                     Pathway nSG  nSR
10                                REACTOME_SIGNALING_BY_FGFR 260 3673
5                      REACTOME_SIGNALING_BY_FGFR_IN_DISEASE 260 3547
2  REACTOME_NGF_SIGNALLING_VIA_TRKA_FROM_THE_PLASMA_MEMBRANE 260 3276
8                    REACTOME_DOWNSTREAM_SIGNAL_TRANSDUCTION 260 3431
3                                 REACTOME_SIGNALLING_BY_NGF 260 3378
4                                REACTOME_SIGNALING_BY_ERBB2 260 3218
7                       REACTOME_SIGNALING_BY_EGFR_IN_CANCER 260 3373
6                                REACTOME_SIGNALING_BY_ERBB4 260 2241
1                              REACTOME_SIGNALING_BY_SCF_KIT 260 2449
9               REACTOME_CYTOKINE_SIGNALING_IN_IMMUNE_SYSTEM 260 2888
   nOverlap    ES       pvalue          FDR
10       97 1.913 5.401353e-12 5.401353e-11
5        91 1.859 1.753193e-10 1.577874e-09
2        66 1.459 4.120740e-04 3.296592e-03
8        68 1.436 5.319814e-04 3.723870e-03
3        65 1.394 1.577755e-03 9.466530e-03
4        62 1.396 1.981885e-03 9.909426e-03
7        62 1.332 5.987946e-03 2.395178e-02
6        39 1.261 5.299439e-02 1.589832e-01
1        38 1.124 1.901741e-01 3.803482e-01
9        41 1.028 3.817132e-01 3.817132e-01

As expected, FGF-related signaling pathways are more statistically significant than others.

Network databases

NetORA could use all networks that are accepted by NetGPA. In current release of NetGPA, we have provided a text-mining network(text_2006_12_NetGPA), a co-expression network(ce_v12_08_NetGPA) and a integrative network(DEPICT_2015_01_NetGPA).

In the future, we will release more networks, including co-expression networks for Mouse and Rat.

Citation

If you use NetORA in published research, please cite NetORA and also (Raychaudhuri 2009).

Session info

sessionInfo()

R version 4.1.0 (2021-05-18)
Platform: x86_64-conda-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux Server 7.6 (Maipo)

Matrix products: default
BLAS/LAPACK: /sibcb2/bioinformatics/software/Miniconda3/lib/libopenblasp-r0.3.15.so

locale:
[1] C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods  
[7] base     

other attached packages:
[1] NetORA_0.99.0  devtools_2.4.2 usethis_2.0.1  ggplot2_3.4.2 
[5] NetGPA_0.99.0  rmarkdown_2.20

loaded via a namespace (and not attached):
 [1] tidyselect_1.2.1  xfun_0.37         bslib_0.2.5.1    
 [4] remotes_2.4.0     purrr_1.0.1       testthat_3.0.4   
 [7] colorspace_2.0-3  vctrs_0.6.3       generics_0.1.3   
[10] htmltools_0.5.2   yaml_2.3.5        utf8_1.2.2       
[13] rlang_1.1.1       pkgbuild_1.2.0    jquerylib_0.1.4  
[16] pillar_1.9.0      glue_1.6.2        withr_2.5.0      
[19] sessioninfo_1.1.1 lifecycle_1.0.3   munsell_0.5.0    
[22] gtable_0.3.0      memoise_2.0.1     evaluate_0.20    
[25] labeling_0.4.2    knitr_1.42        callr_3.7.0      
[28] fastmap_1.1.0     ps_1.6.0          curl_4.3.2       
[31] fansi_1.0.3       highr_0.9         scales_1.2.0     
[34] cachem_1.0.6      desc_1.3.0        pkgload_1.2.1    
[37] jsonlite_1.8.7    farver_2.1.1      fs_1.5.0         
[40] distill_1.5       digest_0.6.29     bookdown_0.33    
[43] processx_3.5.2    dplyr_1.1.2       rprojroot_2.0.2  
[46] grid_4.1.0        cli_3.6.1         tools_4.1.0      
[49] magrittr_2.0.3    sass_0.4.0        tibble_3.2.1     
[52] crayon_1.5.1      pkgconfig_2.0.3   downlit_0.4.2    
[55] ellipsis_0.3.2    xml2_1.3.5        prettyunits_1.1.1
[58] rstudioapi_0.15.0 R6_2.5.1          compiler_4.1.0

References

Aldo M. Roccaro, Jiantao Shi, Antonio Sacco, and Irene M. Ghobrial. 2016. “Exome Sequencing Reveals Recurrent Germ Line Variants in Patients with Familial Waldenström Macroglobulinemia.” Blood 127 (21): 2598–2606.

Raychaudhuri, Robert M. AND Rossin, Soumya AND Plenge. 2009. “Identifying Relationships Among Genomic Disease Regions: Predicting Genes at Pathogenic SNP Associations and Rare Deletions.” PLOS Genetics 5 (6): 1–15. https://doi.org/10.1371/journal.pgen.1000534.

Zachary D. Smith, Hongcang Gu, Jiantao Shi. 2017. “Epigenetic Restriction of Extraembryonic Lineages Mirrors the Somatic Transition to Cancer.” Nature 549 (00): 543–47.

Network-based pathway over-representation analysis with NetORA

Intruduction

Standard workflow

Input

Pathways

Signatures

Background

Global networks

Quick Start

Pre-computed networks

Network databases

Citation

Session info

References

References