Bioinformatics training course: epigenomics: ATAC-seq数据的Peak Calling

本节课我们主要学习如何使用MACS2来对ATAC-seq数据进行Peak Calling。

流程简介

上图描述了分析ATAC-seq的典型流程，本课程有少许改动，如使用fastp去接头，使用ChIPseeker进行peak注释等。

上机数据

我们将使用GSE85624这套数据，包括DUX positive (Treated)、DUX negative (Control) 两种状态，每个状态2个实验重复，总共4个样本。详情请见参考文献。

ls -ls FASTQ_atac | grep gz

1760468 -rwxr-xr-x 1 biocourse01 ods 1802718532 Oct 30  2023 Control_Rep1_R1.fq.gz
1803632 -rwxr-xr-x 1 biocourse01 ods 1846918423 Oct 30  2023 Control_Rep1_R2.fq.gz
2055646 -rwxr-xr-x 1 biocourse01 ods 2104981278 Oct 30  2023 Control_Rep2_R1.fq.gz
1962384 -rwxr-xr-x 1 biocourse01 ods 2009481184 Oct 30  2023 Control_Rep2_R2.fq.gz
2434786 -rwxr-xr-x 1 biocourse01 ods 2493220133 Oct 30  2023 Treated_Rep1_R1.fq.gz
2416517 -rwxr-xr-x 1 biocourse01 ods 2474512971 Oct 30  2023 Treated_Rep1_R2.fq.gz
2239970 -rwxr-xr-x 1 biocourse01 ods 2293728293 Oct 30  2023 Treated_Rep2_R1.fq.gz
2285733 -rwxr-xr-x 1 biocourse01 ods 2340590000 Oct 30  2023 Treated_Rep2_R2.fq.gz

建立所需文件的链接：

ln -s /sibcb1/bioinformatics/biocourse/biocourse01/epigenomics/FASTQ_atac .
ln -s /sibcb1/bioinformatics/biocourse/biocourse01/epigenomics/GRCm38 .
ln -s /sibcb1/bioinformatics/biocourse/biocourse01/epigenomics/output_atac .

数据预处理

示例代码

# 加载必要的环境
source /sibcb1/bioinformatics/training_shared/software/python3.6_profile

sample='Control_Rep1 Control_Rep2 Treated_Rep1 Treated_Rep2'
for sam in $sample; do
  # Alignment
  bowtie2 -p 8 -x GRCm38/Bowtie2_Idx/mm10 \
  -1 trimmed_atac/$sam\_R1.fq.gz -2 trimmed_atac/$sam\_R2.fq.gz \
  -t -N 1 -L 25 -X 2000 --no-mixed --no-discordant \
  -S output_atac/$sam\_bowtie.sam &> output_atac/$sam\_bowtie.log
  samtools view -@ 8 -bS output_atac/$sam\_bowtie.sam > output_atac/$sam\_bowtie.bam
  samtools sort -@ 8 output_atac/$sam\_bowtie.bam -o output_atac/$sam\_sorted.bam
  samtools index output_atac/$sam\_sorted.bam
    
  # Remove reads on chrM
  samtools view -b -q 2 output_atac/$sam\_sorted.bam \
  chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 \
  chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19 \
  chrX chrY | samtools sort -@ 8 -o output_atac/$sam\_filtered.bam -
  
  # Remove duplicates
  java -jar /sibcb/program/install/picard/picard.jar MarkDuplicates \
      I=output_atac/$sam\_filtered.bam \
      O=output_atac/$sam\_dup.bam \
      REMOVE_DUPLICATES=true \
      M=output_atac/$sam\_dup_metrics.txt
  samtools index output_atac/$sam\_dup.bam
    
    # Check insert size distribution
  java -jar /sibcb/program/install/picard/picard.jar CollectInsertSizeMetrics \
      I=output_atac/$sam\_dup.bam \
      O=output_atac/$sam\_dup_inter_size.txt \
      H=output_atac/$sam\_dup_inter_size_histogram.pdf \
      M=0.5
  
  /sibcb/program/install/deeptools-3.2.1/bin/bamCoverage -p 8 --extendReads \
       --normalizeUsing RPKM -b output_atac/$sam\_dup.bam -o output_atac/$sam\_dup.bw
    
  #signal around TSS
  computeMatrix scale-regions -p 8 --binSize 50 -b 2000 -a 2000 -m 6000 -S output_atac/$sam\_dup.bw \
        -R mm10/gencode.vM25.annotation.gtf \
        -o output_atac/$sam\_dup.mat.gz
  plotProfile -m output_atac/$sam\_dup.mat.gz --samplesLabel $sam -o output_atac/$sam\_dup.mat.pdf
done

我们以一个样本（Control_Rep1）为例，展示一下关键的结果和分析报告。

比对

zcat FASTQ_atac/Control_Rep1_R1.fq.gz | wc -l

比对之前有22378376对Reads（44756752条）。

cat output_atac/Control_Rep1_bowtie.log

Time loading reference: 00:00:00
Time loading forward index: 00:00:01
Time loading mirror index: 00:00:00
Multiseed full-index search: 00:46:00
22062490 reads; of these:
  22062490 (100.00%) were paired; of these:
    11232250 (50.91%) aligned concordantly 0 times
    8528547 (38.66%) aligned concordantly exactly 1 time
    2301693 (10.43%) aligned concordantly >1 times
49.09% overall alignment rate
Time searching: 00:46:01
Overall time: 00:46:01

该ATAC-seq数据有相当部分的短片段，当seed长度设为25时，比对率大致为50%。比对后有\((8528547 + 2301693)*2 = 21660480\) Reads。

Read filtering

samtools view -q 2 -f 2 -c output_atac/Control_Rep1_filtered.bam

去掉chrM以后，得到18979414条Reads。

$samtools view -q 2 -f 2 -c output_atac/Control_Rep1_dup.bam

去掉duplicates之后，得到15779714 条Reads。

Insert size分布

可以看到，200之前的峰是未受核小体保护的开放区域，200左右的峰是单个核小体，400左右是跨过两个核小体的峰。

Signal around TSS

可以清楚地看到，TSS周围信号的富集。

Peak Calling

使用切点信息

macs2 callpeak -g mm -B -t output_atac/Control_Rep1_dup.bam output_atac/Control_Rep2_dup.bam \
    --nomodel --shift -100 --extsize 200 --nolambda \
    -n C --outdir macs2_atac -f BAM --keep-dup 1

macs2 callpeak -g mm -B -t output_atac/Treated_Rep1_dup.bam output_atac/Treated_Rep2_dup.bam \
    --nomodel --shift -100 --extsize 200 --nolambda \
    -n T --outdir macs2_atac -f BAM --keep-dup 1

macs2 bdgdiff --t1 macs2_atac/T_treat_pileup.bdg --t2 macs2_atac/C_treat_pileup.bdg \
    --c1 macs2_atac/T_control_lambda.bdg --c2 macs2_atac/C_control_lambda.bdg \
    --depth1 16874307 --depth2 17994984 --outdir macs2_atac \
    -l 500 -g 250 \
    -o T_unique.bed C_unique.bed TC_common.bed

每条Read的变换：向3’端移动100 bp，然后将总长度延伸至200 bp。注意：比对到正链的Reads是向前延伸，比对到负链的Reads是向后延伸。macs2 bdgdiff有两个参数需要设定，分别是--depth1和--depth2，代表两个条件的测序深度，可以从peaks.xls中得到。比较两个条件后，产生三个文件：

ls -la macs2_atac | grep -v summits | grep bed

-rw-r--r-- 1 biocourse01 ods       197711 Sep 11 14:13 C_unique.bed
-rw-r--r-- 1 biocourse01 ods       412527 Sep 11 14:13 TC_common.bed
-rw-r--r-- 1 biocourse01 ods       177027 Sep 11 14:13 T_unique.bed

使用热图查看信号分布：

computeMatrix reference-point -p 24 \
    -S output_atac/Control_Rep1_dup.bw output_atac/Control_Rep2_dup.bw output_atac/Treated_Rep1_dup.bw output_atac/Treated_Rep1_dup.bw \
    -R macs2_atac/T_unique.bed macs2_atac/C_unique.bed macs2_atac/TC_common.bed \
    --referencePoint center -a 5000 -b 5000 \
    -o macs2_atac/matrix.mat.gz

plotHeatmap -m macs2_atac/matrix.mat.gz \
    --colorMap Reds \
    --whatToShow "heatmap and colorbar" \
    --missingDataColor 'white' \
    --refPointLabel 'Peak' \
    --xAxisLabel 'Peak distance (bp)' \
    --regionsLabel "Treated gained" "Treated lost" "Common peaks" \
    -out macs2_atac/DIFF_Peak_heatmap.pdf

使用长度信息

macs2 callpeak -g mm -B -t output_atac/Control_Rep1_dup.bam output_atac/Control_Rep2_dup.bam \
    --nomodel --nolambda \
    -n C --outdir macs2_atac_PE -f BAMPE --keep-dup 1

macs2 callpeak -g mm -B -t output_atac/Treated_Rep1_dup.bam output_atac/Treated_Rep2_dup.bam \
    --nomodel --nolambda \
    -n T --outdir macs2_atac_PE -f BAMPE --keep-dup 1

macs2 bdgdiff --t1 macs2_atac_PE/T_treat_pileup.bdg --t2 macs2_atac_PE/C_treat_pileup.bdg \
    --c1 macs2_atac_PE/T_control_lambda.bdg --c2 macs2_atac_PE/C_control_lambda.bdg \
    --depth1 17505799 --depth2 18760557 --outdir macs2_atac_PE \
    -l 500 -g 250 \
    -o T_unique.bed C_unique.bed TC_common.bed

注意：macs2 callpeak这个命令中，我们使用了BAMPE参数，将每一个Paired Read作为一个片段。类似的方法也可以用在ChIP-seq数据分析中，ATAC-seq数据更多地希望看到切点的富集情况。