Bisulfite sequencing (BS-seq) remains the gold standard technique to detect DNA methylation profiles at single-nucleotide resolution. The DNA methylation status of CpG sites on the same fragment represents a discrete methylation haplotype (mHap). The mHap-level metrics were demonstrated to be promising cancer biomarkers and explain more gene expression variation than average methylation. However, most existing tools focus on average methylation and neglect mHap patterns. Previously, we have developped mHapTk, a comprehensive toolkit for the analysis of DNA methylation haplotypes. Here, we present mHapSuite, a Java reimplementation of mHapTk for the analysis of mhaps. mHapSuite shows significant improvment in terms of both functionality and efficiency.
It is written in Java, and should run in Windows, Linux and macOS with java 1.8 installed. The source code of mHapSuite is hosted in github and the jar file is ready to use.
Accession Number | Description | mHap | index |
---|---|---|---|
GSE149608 | esophageal squamous cell carcinoma (Pool of 10 samples) |
esophagus_T.mhap.gz | esophagus_T.mhap.gz.tbi |
GSE149608 | adjacent normal esophageal tissue (Pool of 10 samples) |
esophagus_N.mhap.gz | esophagus_N.mhap.gz.tbi |
GSM4505856 | adjacent normal esophageal tissue | SRX8208802.mhap.gz | SRX8208802.mhap.gz.tbi |
File name | Description | Download |
---|---|---|
ESCC_Hyper_Promoter.BED | Promoters (-1 kb to 1 kb) that overlap with hyper-DMRs in ESCC (GSE149608) | Download |
ESCC_Hypo_Promoter.BED | Promoters (-1 kb to 1 kb) that overlap with hypo-DMRs in ESCC (GSE149608) | Download |
esophagus_UMRs.bed | UMRs called from normal esophagus tissues | Download |
esophagus_LMR.bed | LMRs called from normal esophagus tissues | Download |
esophagus_PMDs.bed | PMDs called from normal esophagus tissues | Download |
esophagus_HMRs.bed | HMRs called from normal esophagus tissues | Download |
esophagus_T_MHB.bed | MHBs called form esophagus 10 tumor samples (GSE149608) | Download |
esophagus_N_MHB.bed | MHBs called from esophagus 10 normal samples (GSE149608) | Download |
esophagus_open.bed | Open chromatin regions called from normal esophagus tissues | Download |
Metric | Tumor BigWig file | Normal BigWig file |
---|---|---|
Mean methylation | esophagus_T_MM.bw | esophagus_N_MM.bw |
PDR | esophagus_T_PDR.bw | esophagus_N_PDR.bw |
CHALM | esophagus_T_CHALM.bw | esophagus_N_CHALM.bw |
MHL | esophagus_T_MHL.bw | esophagus_N_MHL.bw |
MBS | esophagus_T_MBS.bw | esophagus_N_MBS.bw |
MCR | esophagus_T_MCR.bw | esophagus_N_MCR.bw |
Entropy | esophagus_T_Entropy.bw | esophagus_N_Entropy.bw |
R2 | esophagus_T_R2.bw | esophagus_N_R2.bw |
In order to run mHapSuite, some annotation files are needed, including CpG position files and chromosome size files. CpG position files can be downloaed from this page and chromosome size files can be downloaded from UCSC genome browser.
Genome | CpG sites | chromosome size |
---|---|---|
hg19 | hg19_CpG.gz hg19_CpG.gz.tbi |
hg19.chrom.sizes |
hg38 | hg38_CpG.gz hg38_CpG.gz.tbi |
hg38.chrom.sizes |
mm9 | mm9_CpG.gz mm9_CpG.gz.tbi |
mm9.chrom.sizes |
mm10 | mm10_CpG.gz mm10_CpG.gz.tbi |
mm10.chrom.sizes |
CpG annotation file for other genomes can be created by a convenient tool, as described in this page.
mHapSuite is a Java toolkit for the analysis of DNA methylation haplotypes. It has 12 sub-commands: convert, merge, tanghulu, mHapView, MHBDiscovery, stat, genomeWide, scatterPlot, boxPlot, heatMapPlot, profilePlot and enrichmentPlot.
mHapSuite: a Java toolkit for analysis of DNA methylation haplotypes
Usage: java -jar mHapSuite <command> [options]
Commands:
convert convert BAM files to mHap files
merge merge multiple mHap files
tanghulu plot the DNA methylation status for mHaps in a region
mHapView show linkage disequilibrium (LD) scores of CpGs in a region
MHBDiscovery identification of methylation haplotype blocks within a region or genome-wide
stat calculate DNA methylation metrics for mHaps that cover predefined regions
genomeWide calculate DNA methylation metrics for mHaps that cover each CpG site across the genome
scatterPlot show a scatter plot for two DNA methylation metrics
boxPlot show the distribution of metric values from multiple samples as a boxplot
heatMapPlot show signals around the center of given regions as a heatmap
profilePlot show the average profiles of signals in the predefined intervals and their flanking regions
enrichmentPlot show the perventage of genomic features that overlap with predefined open chromatin regions
This function converts SAM/BAM format file to mHap format file. It takes an indexed Bisulfite-seq BAM and CpGs position files as inputs to extract DNA methylation haplotypes.
usage: java -jar mHapSuite-2.0-jar-with-dependencies.jar convert --inputFile <in.bam> \
--cpgPath <CpG.gz> [--region chr:start-end] [--bedPath <in.bed>] \
--outputFile <out> [--nonDirectional] [--mode mode] [--pat pat]
Options:
-inputFile,--inputFile <args> input file, SAM/BAM format, should be sorted by samtools
-cpgPath,--cpgPath <args> genomic CpG file, gz format and indexed
-region,--region <args> one region, in the format of chr:start-end
-bedPath,--bedPath <args> a bed file, one query region per line
-outPutFile,--outPutFile <args> output filename, default: out.mhap.gz
-nonDirectional,--nonDirectional do not group results by the direction of reads.
-mode,--mode <args> sequencing mode, TAPS | BS (default)
-pat,--pat whether inputPath is pat file
Example of usage:
java -jar mHapSuite-2.0-jar-with-dependencies.jar convert -inputFile SRX8472828.bam \
-cpgPath hg19_CpG.gz -outPutFile test.mhap.gz
This function merges multiple sorted mHap files, produce a single sorted mHap file.
usage: java -jar mHapSuite-2.0-jar-with-dependencies.jar merge --inputFile <1.mhap.gz 2.mhap.gz ...> \
--cpgPath <CpG.gz> --outputFile <out>
Options:
-inputFile,--inputFile <args> multiple mHap files, gz format, should be indexed
-cpgPath,--cpgPath <args> genomic CpG file, gz format and indexed
-outPutFile,--outPutFile <args> output filename, default: out.mhap.gz
Example of usage:
java -jar mHapSuite-2.0-jar-with-dependencies.jar merge -inputFile SRX8208800.mhap.gz \
-cpgPath hg19_CpG.gz -outPutFile out.mhap.gz SRX8208801.mhap.gz
This function is to visualize the methylation status of the reads that cover a given region. Tanghulu plot was originally introduced in CGmapTools.
usage: java -jar mHapSuite-2.0-jar-with-dependencies.jar tanghulu --mhapPath <mhap.gz> \
--cpgPath <CpG.gz> --region chr:start-end --tag <out> [--outFormat outFormat] [--strand strand] \
[--maxReads maxReads] [--maxLength maxLength] [--merge] [--simulation] [--cutReads]Options:
-mhapPath,--mhapPath <arg> input file, mhap.gz format, generated by mHapTools and indexed
-cpgPath,--cpgPath <arg> genomic CpG file, gz format and indexed
-region,--region <arg> one region, in the format of chr:start-end
-tag,--tag <arg> prefix of the output file(s)
-outFormat,--outFormat <arg> output format,pdf or png [pdf]
-strand,--strand <arg> plus,minus,both [both]
-maxReads,--maxReads <arg> the max number of reads to plot [50]
-maxLength,--maxLength <arg> the max length of region to plot [2000]
-merge,--merge indicates whether identical mHaps should be merged
-simulation,--simulation indicates whether mHaps should be simulated
-cutReads,--cutReads indicates whether only keep CpGs in the defined region
Example of usage:
java -jar mHapSuite-2.0-jar-with-dependencies.jar tanghulu --mhapPath data/SRX8208802.mhap.gz \
--cpgPath idx/hg19_CpG.gz --region chr1:14435-14653 -tag SRX8208802
When setting the --merge
option, identical mHaps will be merged. This can be helpful when there are too many reads in a region.
java -jar mHapSuite-2.0-jar-with-dependencies.jar tanghulu --mhapPath data/SRX8208802.mhap.gz \
--cpgPath idx/hg19_CpG.gz --region chr1:14435-14653 -tag SRX8208802 --merge
Sometimes, we may want to visualize the mHap patterns of different samples that were sequenced with different depth. With the --simulation
option, a given number (20 by default) of mHaps were simulated to maximize the likelihood given the mHaps observed in each sample.
java -jar mHapSuite-2.0-jar-with-dependencies.jar tanghulu --mhapPath data/SRX8208802.mhap.gz \
--cpgPath idx/hg19_CpG.gz --region chr1:14435-14653 -tag SRX8208802 --simulation
The co-methylation level of pairwise CpGs are measure by linkage disequilibrium r-squared, caculated from individual reads rather than mean methylation. This tool calculates the R2 scores between pairs of CpG sites in a given region. Negative values indicate negative correlation. It can also plot the CpG linkage, which is similar to the Haploview plot.
usage: java -jar mHapSuite-2.0-jar-with-dependencies.jar mHapView --mhapPath <mhap.gz> \
--cpgPath <CpG.gz> [--region chr:start-end] [--bedPath <in.bed>] --tag <tag> [--outFormat outFormat] [--strand strand]
Options:
-mhapPath,--mhapPath <arg> input file, mhap.gz format, generated by mHapTools and indexed
-cpgPath,--cpgPath <arg> genomic CpG file, gz format and indexed
-region,--region <arg> one region, in the format of chr:start-end
-bedPath,--bedPath <arg> a bed file
-tag,--tag <arg> prefix of the output file(s)
-outFormat,--outFormat <arg> output format,pdf or png [pdf]
-strand,--strand <arg> plus,minus,both [both]
java -jar mHapSuite-2.0-jar-with-dependencies.jar mHapView --mhapPath data/esophagus_T.mhap.gz \
--cpgPath idx/hg19_CpG.gz --region chr1:2121159-2121449 -tag esophagus_T
DNA methylation of adjacent CpG sites can be co-methylated and forms methylation haplotype blocks (MHBs). It was original introduced in Guo et al., 2017.
usage: java -jar mHapSuite-2.0-jar-with-dependencies.jar MHBDiscovery --mhapPath <mhap.gz> \
--cpgPath <CpG.gz> [--region chr:start-end] [--bedPath <in.bed>] -outputDir outputDir --tag <tag> \
[--window window] [--r2 r2] [--pvalue pvalue]Options:
-mhapPath,--mhapPath <arg> input file, mhap.gz format, generated by mHapTools and indexed
-cpgPath,--cpgPath <arg> genomic CpG file, gz format and indexed
-region,--region <arg> one region, in the format of chr:start-end
-bedPath,--bedPath <arg> a bed file, one query region per line
-outputDir,--outputDir <arg> output directory, created in advance
-tag,--tag <arg> prefix of the output file(s) default: MHBDiscovery_out
-window,--window <arg> Size of core window [5]
-r2,--r2 <arg> R square cutoff [0.5]
-pvalue,--pvalue <arg> P value cutoff [0.05]
java -jar mHapSuite-2.0-jar-with-dependencies.jar MHBDiscovery -mhapPath data/esophagus_T.mhap.gz \
-cpgPath idx/hg19_CpG.gz -window 5 -r2 0.5 -pvalue 0.05 -outputDir out -tag esophagus_T_MHB
java -jar mHapSuite-2.0-jar-with-dependencies.jar MHBDiscovery -mhapPath data/esophagus_N.mhap.gz \
-cpgPath idx/hg19_CpG.gz -window 5 -r2 0.5 -pvalue 0.05 -outputDir out -tag esophagus_N_MHB
Calculate the methylation statistics in a single region or multiple regions. When calculating a single interval, the –region parameter should be defined. If there are multiple intervals, a BED file needs to be specified.
usage: java -jar mHapSuite-2.0-jar-with-dependencies.jar stat --metrics <metrics> \
--mhapPath <mhap.gz> --cpgPath <CpG.gz> [--region chr:start-end] [--bedPath <in.bed>] \
--outputFile <out.tsv> [--minK minK] [--maxK maxK] [--K K] [--strand strand] [--cutReads] [--r2Cov r2Cov]
Options:
-metrics,--metrics <arg> mHap-level metrics, including MM, PDR, CHALM, MHL, MCR, MBS, Entropy and R2 [None]
-mhapPath,--mhapPath <arg> input file, mhap.gz format, generated by mHapTools and indexed
-cpgPath,--cpgPath <arg> genomic CpG file, gz format and indexed
-region,--region <arg> one region, in the format of chr:start-end
-bedPath,--bedPath <arg> input BED file
-outputFile,--outputFile <arg> output filename, default: stat.out.tsv
-minK,--minK <arg> minimum k-mer length for MHL [1]
-maxK,--maxK <arg> maximum k-mer length for MHL [10]
-K,--K <arg> k-mer length for entropy, PDR, and CHALM, can be 3, 4, or 5 [4]
-cutReads,--cutReads indicates whether only keep CpGs in the defined region
-strand,--strand <arg> plus,minus,both [both]
-r2Cov,--r2Cov <arg> minimal number of reads that cover two CpGs for R2 calculation [20]
Example of usage:
java -jar mHapSuite-2.0-jar-with-dependencies.jar stat -metrics MM PDR CHALM MHL MCR MBS Entropy R2 \
-mhapPath esophagus_T.mhap.gz -cpgPath hg19_CpG.gz -bedPath CRC_MHB_non_NC.bed -outputFile stat.out.tsv \
-minK 1 -maxK 10 -K 4 -strand both -r2Cov 10
Calcuate mHap-level metrics for all reads that cover each individual CpG sites in a genome-wide manner and save the results as bedGraph files.
usage: java -jar mHapSuite-2.0-jar-with-dependencies.jar genomeWide --metrics <metrics> \
--mhapPath <mhap.gz> --cpgPath <CpG.gz> [--region chr:start-end] [--bedPath <in.bed>] \
--outputDir <outputDir> --tag <tag> [--minK minK] [--maxK maxK] [--K K] [--strand strand] \
[--cpgCov cpgCov] [--r2Cov r2Cov] [--k4Plus k4Plus]Options:
-metrics,--metrics <arg> mHap-level metrics,including Cov, MM, PDR, CHALM, MHL, MCR, MBS, Entropy, and R2
-mhapPath,--mhapPath <arg> input file, mhap.gz format, generated by mHapTools and indexed
-cpgPath,--cpgPath <arg> genomic CpG file, gz format and indexed
-region,--region <arg> one region, in the format of chr:start-end
-bedPath,--bedPath <arg> a bed file, one query region per line
-outputDir,--outputDir <arg> output directory, created in advance
-tag,--tag <arg> prefix of the output file(s), default: genomeWide_out
-minK,--minK <arg> minimum k-mer length for MHL [1]
-maxK,--maxK <arg> maximum k-mer length for MHL [10]
-K,--K <arg> k-mer length for entropy, PDR, and CHALM, can be 3, 4, or 5 [4]
-strand,--strand <arg> plus,minus,both [both]
-cpgCov,--cpgCov <arg> minimal number of CpG coverage for MM calculation [5]
-r2Cov,--r2Cov <arg> minimal number of reads that cover two CpGs for R2 calculation [20]
-k4Plus,--k4Plus <arg> minimal number of reads that cover 4 or more CpGs for PDR, CHALM, MHL, MCR, MBS and Entropy [5]
This function generates a scatter plot for two variables. It can be two DNA methylation metrics from the same sample or one metric from two samples.
usage: java -jar mHapSuite-2.0-jar-with-dependencies.jar scatterPlot --bedPath <in.bed> \
--bigwig1 <first.bw> --bigwig2 <second.bw> --tag <tag> --outFormat <outFormat>
Options:
-bedPath,--bedPath <arg> a bed file, one query region per line
-bigwig1,--bigwig1 <arg> the first input bigwig file of a metrics rom a sample
-bigwig2,--bigwig2 <arg> the second input bigwig file of another metrics, from the same sample of bigwig1, or from another sample
-tag,--tag <arg> prefix of the output file(s)
-outFormat,--outFormat <arg> output format,pdf or png [pdf]
java -jar mHapSuite-2.0-jar-with-dependencies.jar scatterPlot -bedPath idx/hg19_cpgisland.bed \
-bigwig1 bw/T/esophagus_T_MM.bw -bigwig2 bw/T/esophagus_T_PDR.bw -tag MM_vs_PDR -outFormat png
A box plot which show the distribution of metric values from multiple samples.
usage: java -jar mHapSuite-2.0-jar-with-dependencies.jar boxPlot --bedPath <in.bed> \
--bigwigs <first.bw second.bw> --tag <tag> --outFormat <outFormat>
Options:
-bedPath,--bedPath <arg> a bed file, one query region per line
-bigwigs,--bigwigs <arg> multi bigwig files, split by blank character
-tag,--tag <arg> prefix of the output file(s)
-outFormat,--outFormat <arg> output format,pdf or png [pdf]
java -jar mHapSuite-2.0-jar-with-dependencies.jar boxPlot -bedPath BED/ESCC_Hyper_Promoter.BED \
--bigwigs bw/T/esophagus_T_MM.bw bw/N/esophagus_N_MM.bw bw/T/esophagus_T_Entropy.bw bw/N/esophagus_N_Entropy.bw --tag MM_PDR --outFormat pdf
This function plots signals around the center of given regions. The lower panel shows the signals in each interval as a heatmap, and the average signals across all regions in are shown in the upper panel. This function was inspired by plotHeatmap in deeTools.
usage: java -jar mHapSuite-2.0-jar-with-dependencies.jar heatMapPlot --bedPaths <first.bed second.bed> \
--bigwig <in.bw> [--upLength <upLength>] [--downLength <downLength>] [--window <window>] --tag <tag> --outFormat <outFormat>
Options:
-bedPaths,--bedPaths <arg> multi bed files path, one query region per line in a bed file
-bigwig,--bigwig <arg> the input bigwig file of a metrics from a sample
-upLength,--upLength <arg> the length of upstream region from the center point [5000]
-downLength,--downLength <arg> the length of downstream region from the center point [5000]
-window,--window <arg> the length of the window [100]
-tag,--tag <arg> prefix of the output file(s)
-outFormat,--outFormat <arg> output format,pdf or png [pdf]
java -jar mHapSuite-2.0-jar-with-dependencies.jar heatMapPlot -bedPaths BED/ESCC_Hyper_Promoter.BED BED/ESCC_Hypo_Promoter.BED \
-upLength 5000 -downLength 5000 -window 200 --sortRegions descend -bigwig bw/N/esophagus_N_MM.bw \
-tag heatMapPlot_N -outFormat png
We can also generate this type of plot using deepTools. However, you might get slightly different results since deepTools plot signals near minimal and maximal cutoffs as missing values.
/sibcb/program/install/deeptools-3.2.1/bin/computeMatrix reference-point --referencePoint center \
-b 5000 -a 5000 -R BED/ESCC_Hyper_Promoter.BED BED/ESCC_Hypo_Promoter.BED -S bw/N/esophagus_N_MM.bw -o out/heatMapPlot_N.gz --binSize 200
/sibcb/program/install/deeptools-3.2.1/bin/plotHeatmap -m out/heatMapPlot_N.gz --heatmapHeight 15 \
--dpi 300 --missingDataColor 1 --zMin 0 --zMax 1 --refPointLabel 'MHB center' -out heatMapPlot_N_deepTools.png
This function plots the average profiles of signals in the predefined intervals as well as their flanking regions. Multiple BED files can be specified using the ‘bedPaths’ parameter. It was inspired by plotProfile in in deeTools.
usage: java -jar mHapSuite-2.0-jar-with-dependencies.jar profilePlot --bedPaths <first.bed second.bed> \
--bigwig <in.bw> [--upLength <upLength>] [--downLength <downLength>] [--windowNum <windowNum>] --tag <tag> \
--outFormat <outFormat>
Options:
-bedPaths,--bedPaths <arg> multi bed files path, one query region per line in a bed file
-bigwig,--bigwig <arg> the input bigwig file of a metrics from a sample
-upLength,--upLength <arg> the length of upstream region from the center point [2000]
-downLength,--downLength <arg> the length of downstream region from the center point [2000]
-windowNum,--windowNum <arg> the window number of the core region [10]
-tag,--tag <arg> prefix of the output file(s)
-outFormat,--outFormat <arg> output format,pdf or png [pdf]
java -jar mHapSuite-2.0-jar-with-dependencies.jar profilePlot -bedPaths BED/ESCC_Hyper_Promoter.BED \
-upLength 20000 -downLength 20000 -windowNum 10 -bigwig bw/T/esophagus_T_MM.bw \
BED/ESCC_Hypo_Promoter.BED -tag profilePlot_T
We can also generate similar heatmap using deepTools.
/sibcb/program/install/deeptools-3.2.1/bin/computeMatrix scale-regions \
-R BED/ESCC_Hyper_Promoter.BED BED/ESCC_Hypo_Promoter.BED -S bw/T/esophagus_T_MM.bw \
--startLabel START --endLabel END --beforeRegionStartLength 2000 --afterRegionStartLength 2000 \
--binSize 100 -o out/profilePlot_T.gz
/sibcb/program/install/deeptools-3.2.1/bin/plotProfile -m out/profilePlot_T.gz \
--dpi 300 -out profilePlot_scaled_deepTools.png
This function shows the perventage of genomic features that overlap with predefined open chromatin regions.
usage: java -jar mHapSuite-2.0-jar-with-dependencies.jar enrichmentPlot --bedPaths <first.bed second.bed> \
--openChromatin <openChromatin.bed> --bigwig <in.bw> [--groupNum <groupNum>] [--groupCutoff <groupCutoff>] \
--tag <tag> --outFormat <outFormat>
Options:
-bedPaths,--bedPaths <arg> multi bed files path, one query region per line in a bed file
-openChromatin,--openChromatin <arg> a open chromatin regions file, in BED format
-bigwig,--bigwig <arg> the input bigwig file of a metrics from a sample
-groupNum,--groupNum <arg> the group number of the metrics value [10]
-groupCutoff,--groupCutoff <arg> the minimum valid value number in the group [100]
-tag,--tag <arg> prefix of the output file(s)
-outFormat,--outFormat <arg> output format,pdf or png [pdf]
java -jar mHapSuite-2.0-jar-with-dependencies.jar enrichmentPlot \
--bedPaths out/esophagus_N_MHB.txt BED/esophagus_UMRs.bed \
--openChromatin BED/esophagus_open.bed \
BED/esophagus_LMRs.bed BED/esophagus_PMDs.bed BED/esophagus_HMRs.bed --bigwig bw/N/esophagus_N_MM.bw --tag esophagus_tissue --outFormat pdf
If you use mHapSuite, please cite the following reference:
Yi Ding#, Kangwen Cai#, Leiqin Liu, Zhiqiang Zhang, Xiaoqi Zheng* and Jiantao Shi* mHapTk: A comprehensive toolkit for the analysis of DNA methylation haplotypes. Bioinformatics 2022, 38(22):5141-5143 [Pubmed]