mHapTools wad designed to be easy to use and it has rich build-in help documents. There are 4 main functions: convert, merge, beta and summary.
mHapTools 0.10 (Tools for analysing methylated reads)
Usage: mhaptools <command> [options]
Commands:
convert SAM/BAM --> mhap conversion
merge merge multiple sorted mhap.gz files into a single mhap.gz file
beta count methylated reads on CpG positions
summary get local or global Summarized Information
We developed a new file format, mHap, for efficient storage and manipulation. This is the main function of mHapTools, which takes an indexed BAM file as input and extract DNA methylation haplotypes.
Usage: mhaptools convert -i <in.bam>|<in.sam> -c <CpG.gz> [-r chr:start-end | -b bed_file.bed ] [-n] [-o name.mhap.gz]
Options:
-i --input input file, SAM/BAM format, sorted by samtools
-c --cpg str genomic CpG file, gz format and Indexed
-r --region str region
-b --bed only include CpG overlapping this BED FILE, which must be sorted [null]
-n --non-directional Non-directional libraries. Default: OFF.
-o --output str output file name [out.mhap.gz]
Here are some examples. Convert the entire SAM/BAM file to mHap format:
mhaptools convert -i in.bam -c CpG.gz
Convert the SAM/BAM file to mHap format within a region:
mhaptools convert -i in.bam -c CpG.gz -r chr1:2000-200000
Convert the SAM/BAM file to mHap format within several regions:
mhaptools convert -i in.bam -c CpG.gz -b bed_file.bed
The mHap format is a simple yet powerful way to represent DNA methylation haplotypes. It is a genomic position-based format with six columns, including chr
, start
, end
, DNA methylation haplotypes (methylation state)
, reads count
and strand
. DNA methylation haplotypes were represented as strings with only 0 and 1, and 1 indicates methylated CpG site and 0 indicates un-methylated CpG sites.
<chr> <start> <end> <methylation_state> <reads count> <strand>
10493 10579 11110111 1 -
chr1 10525 10525 1 5 +
chr1 10525 10542 11 5 +
chr1 10525 10563 111 2 + chr1
The mHap file is genomic position-based, and could be indexed by Tabix for random access.
cat out.mhap |sort -k1,1 -k2,2n | bgzip > out.mhap.gz
tabix -b 2 -e 3 -p bed out.mhap.gz
Traditionally, DNA methylation is measured at each CpG site level. From DNA methylation haplotypes, we could easily get CpG site summary.
Usage: mhaptools beta -i <in.mhap.gz> -c <CpG.gz> [-b bed_file.bed] [-s] [-o name.txt]
Options:
-i --input str input file, mhap.gz format
-c --cpg str genomic CpG file, gz format and Indexed
-b --bed str bed file, contains query regions
-s --stranded flag group results by the direction of mhap reads
-o --output str output file name [beta.txt]
Here are some examples. Get mean methylation summary for all CpG sites:
mhaptools beta -i in.mhap.gz -c CpG.gz
Get mean methylation summary for watson and crick seperately across all CpG sites:
mhaptools beta -i in.mhap.gz -c CpG.gz -s
Get mean methylation summary for CpG sites that overlap with regions in a BED file:
mhaptools beta -i in.mhap.gz -c CpG.gz -b bed_file.bed
The mHapTools beta output format is similar as that generated by Bismark or BSMAP methylation extractor.
<chr> <start> <end> <methylation-level> <methylated Cs> <unmethylated Cs> <strand>
10493 10494 100 1 0 +
chr1 10497 10498 96 148 6 +
chr1 10525 10526 94 286 17 -
chr1 10542 10543 94 283 18 +
chr1 10563 10564 93 277 19 - chr1
This function merges multiple sorted and gziped mHap files, then outputs a single sorted and gziped output file that contains all the input records.
Usage: mhaptools merge -i <in1.mhap.gz in2.mhap.gz in3.mhap.gz> -c <CpG.gz> [-o name.mhap.gz]
Options:
-i --input str List of input mhap.gz files
-c --cpg str genomic CpG file, gz format and Indexed
-o --output str output file name [out.mhap.gz]
Example of usage:
mhaptools merge -i in1.mhap.gz in2.mhap.gz in3.mhap.gz -c CpG.gz
The output file out.mhap.gz can be further indexed:
tabix -b 2 -e 3 -p bed out.mhap.gz
Usage: mhaptools summary -i <in.mhap.gz> -c <CpG.gz> [-r chr:start-end | -b bed_file.bed ] | [-g] [-s] [-o name.txt]
Options:
-i str input file, mhap.gz format
-c str CpG file, gz format
-r str region, e.g. chr1:2000-200000
-b str bed file, one query region per line
-g flag get genome-wide result
-s flag group results by the direction of mhap reads
-o str output filename [genome_wide.txt | summary.txt]
This function summerizes discordant methylation in pre-defined regions or genome wide. The output file is a genomic position-based format with 9 columns, including chr
, start
, end
, strand
,nReads
, mBase
,tBase
, K4plus
and nDR
.
column Description nReads the number of mapped reads mBase the methylated CpGs within mapped reads tBase total number of CpGs within mapped reads K4plus the number of mapped reads with at least 4 CpGs nDR the number of reads with discordant methylation status amomg all reads defined by K4plus
<chr> <start> <end> <strand> <nReads> <mBase> <tBase> <K4plus> <nDR>
762415 763445 * 38 0 106 0 0
chr1 805197 805628 * 65 4 186 19 0
chr1 839693 840619 * 197 6 945 113 0
chr1 844298 845883 * 214 173 535 39 16
chr1 858969 861632 * 541 59 1932 255 12
chr1 869331 871872 * 194 392 453 20 7
chr1 875729 878363 * 580 89 2048 223 11 chr1
Get summary within a region:
mhaptools summary -i in.mhap.gz -c CpG.gz -r chr1:2000-200000
Get summary for several regions defined in a BED file:
mhaptools summary -i in.mhap.gz -c CpG.gz -b bed_file.bed