Tools for DNA methylation haplotype analysis

mHapTools wad designed to be easy to use and it has rich build-in help documents. There are 4 main functions: convert, merge, beta and summary.

mHapTools 0.10 (Tools for analysing methylated reads)
Usage:    mhaptools <command> [options]

Commands:
    convert     SAM/BAM --> mhap conversion
    merge       merge multiple sorted mhap.gz files into a single mhap.gz file
    beta        count methylated reads on CpG positions
    summary     get local or global Summarized Information

BAM to mHap

We developed a new file format, mHap, for efficient storage and manipulation. This is the main function of mHapTools, which takes an indexed BAM file as input and extract DNA methylation haplotypes.

Usage: mhaptools convert -i <in.bam>|<in.sam> -c <CpG.gz> [-r chr:start-end | -b bed_file.bed ] [-n] [-o name.mhap.gz]
Options:
  -i --input  input file, SAM/BAM format, sorted by samtools
  -c --cpg    str  genomic CpG file, gz format and Indexed
  -r --region str  region
  -b --bed    only include CpG overlapping this BED FILE, which must be sorted [null]
  -n --non-directional Non-directional libraries. Default: OFF.
  -o --output str  output file name [out.mhap.gz]

Here are some examples. Convert the entire SAM/BAM file to mHap format:

mhaptools convert -i in.bam -c CpG.gz

Convert the SAM/BAM file to mHap format within a region:

mhaptools convert -i in.bam -c CpG.gz -r chr1:2000-200000

Convert the SAM/BAM file to mHap format within several regions:

mhaptools convert -i in.bam -c CpG.gz -b bed_file.bed

The mHap format is a simple yet powerful way to represent DNA methylation haplotypes. It is a genomic position-based format with six columns, including chr, start, end, DNA methylation haplotypes (methylation state), reads count and strand. DNA methylation haplotypes were represented as strings with only 0 and 1, and 1 indicates methylated CpG site and 0 indicates un-methylated CpG sites.

<chr> <start> <end> <methylation_state> <reads count> <strand>
chr1    10493   10579   11110111        1       -
chr1    10525   10525   1       5       +
chr1    10525   10542   11      5       +
chr1    10525   10563   111     2       +

The mHap file is genomic position-based, and could be indexed by Tabix for random access.

cat out.mhap |sort -k1,1 -k2,2n | bgzip > out.mhap.gz
tabix -b 2 -e 3 -p bed out.mhap.gz

mHap to CpG site summary

Traditionally, DNA methylation is measured at each CpG site level. From DNA methylation haplotypes, we could easily get CpG site summary.

Usage: mhaptools beta -i <in.mhap.gz> -c <CpG.gz> [-b bed_file.bed] [-s] [-o name.txt]
Options:
  -i --input   str  input file, mhap.gz format
  -c --cpg     str  genomic CpG file, gz format and Indexed
  -b --bed     str  bed file, contains query regions
  -s --stranded  flag group results by the direction of mhap reads
  -o --output  str  output file name [beta.txt]

Here are some examples. Get mean methylation summary for all CpG sites:

mhaptools beta -i in.mhap.gz -c CpG.gz

Get mean methylation summary for watson and crick seperately across all CpG sites:

mhaptools beta -i in.mhap.gz -c CpG.gz -s

Get mean methylation summary for CpG sites that overlap with regions in a BED file:

  mhaptools beta -i in.mhap.gz -c CpG.gz -b bed_file.bed

The mHapTools beta output format is similar as that generated by Bismark or BSMAP methylation extractor.

<chr> <start> <end> <methylation-level> <methylated Cs> <unmethylated Cs> <strand>
chr1    10493   10494   100     1       0       +
chr1    10497   10498   96      148     6       +
chr1    10525   10526   94      286     17      -
chr1    10542   10543   94      283     18      +
chr1    10563   10564   93      277     19      -

Merge multiple mHap files

This function merges multiple sorted and gziped mHap files, then outputs a single sorted and gziped output file that contains all the input records.

Usage: mhaptools merge -i <in1.mhap.gz in2.mhap.gz in3.mhap.gz> -c <CpG.gz> [-o name.mhap.gz]
Options:
  -i --input  str  List of input mhap.gz files
  -c --cpg    str  genomic CpG file, gz format and Indexed
  -o --output str  output file name [out.mhap.gz]

Example of usage:

mhaptools merge -i in1.mhap.gz in2.mhap.gz in3.mhap.gz -c CpG.gz

The output file out.mhap.gz can be further indexed:

tabix -b 2 -e 3 -p bed out.mhap.gz

Summary of discordant methylation

Usage: mhaptools summary -i <in.mhap.gz> -c <CpG.gz> [-r chr:start-end | -b bed_file.bed ] | [-g] [-s] [-o name.txt]
Options:
  -i  str  input file, mhap.gz format
  -c  str  CpG file, gz format
  -r  str  region, e.g. chr1:2000-200000
  -b  str  bed file, one query region per line
  -g  flag get genome-wide result
  -s  flag group results by the direction of mhap reads
  -o  str  output filename [genome_wide.txt | summary.txt]

This function summerizes discordant methylation in pre-defined regions or genome wide. The output file is a genomic position-based format with 9 columns, including chr, start, end, strand,nReads, mBase,tBase, K4plus and nDR.

column Description
nReads the number of mapped reads
mBase the methylated CpGs within mapped reads
tBase total number of CpGs within mapped reads
K4plus the number of mapped reads with at least 4 CpGs
nDR the number of reads with discordant methylation status amomg all reads defined by K4plus
<chr> <start> <end> <strand> <nReads> <mBase> <tBase> <K4plus> <nDR>
chr1    762415  763445  *       38      0       106     0       0
chr1    805197  805628  *       65      4       186     19      0
chr1    839693  840619  *       197     6       945     113     0
chr1    844298  845883  *       214     173     535     39      16
chr1    858969  861632  *       541     59      1932    255     12
chr1    869331  871872  *       194     392     453     20      7
chr1    875729  878363  *       580     89      2048    223     11

Get summary within a region:

mhaptools summary -i in.mhap.gz -c CpG.gz -r chr1:2000-200000

Get summary for several regions defined in a BED file:

mhaptools summary -i in.mhap.gz -c CpG.gz -b bed_file.bed