library("rmarkdown")
library("knitr")
library("formatR")
knitr::opts_chunk$set(echo = TRUE,
include = TRUE,
eval = FALSE,
comment = "")
# knitr::opts_chunk$set(tidy.opts = list(width.cutoff = 60), tidy = TRUE)
为了演示的方便,我们下载了一个公共的全外显子测序的数据集HG003。这个数据集有标准的SNP分析结果可供我们比较。大家可以使用’rsync’命令将这些文件同步到自己的主目录。
rsync -avzh --progress /sibcb2/bioinformatics/Projects/Course/BWA/bam .
rsync -avzh --progress /sibcb2/bioinformatics/Projects/Course/BWA/benchmark .
我们先把BAM文件转化成FASTQ文件。
java -Xmx16g -jar /sibcb2/bioinformatics/software/GATK/picard-2.26.10/picard.jar SamToFastq \
--INPUT bam/HG003.novaseq.wes_idt.100x.dedup.bam \
--FASTQ FASTQ/HG003_R1.fastq.gz \
--SECOND_END_FASTQ FASTQ/HG003_R2.fastq.gz \
--UNPAIRED_FASTQ FASTQ/HG003.fastq.gz
如果运行正确,我们将得到Read 1和Read 2两个FASTQ文件。
-rw-r--r-- 1 biocourse01 ods 1.2G Apr 8 11:15 HG003_R1.fastq.gz
-rw-r--r-- 1 biocourse01 ods 1.3G Apr 8 11:15 HG003_R2.fastq.gz
检查序列质量:
fastqc -o output FASTQ/HG003_R1.fastq.gz FASTQ/HG003_R2.fastq.gz
使用参考序列的FASTA文件构建BWA index,这里不建议大家运行。
bwa index -p GRCh38 GRCh38.fasta
samtools dict GRCh38.fasta > GRCh38.dict
samtools faidx GRCh38.fasta
bwa mem -t 16 \
\
/sibcb2/bioinformatics/Projects/Course/BWA/reference/GRCh38 \
FASTQ/HG003_R1.fastq.gz FASTQ/HG003_R2.fastq.gz -R '@RG\tID:HG003\tPL:ILLUMINA\tPU:NONE\tLB:HG003\tSM:HG003' \
2>output/bwa_HG003.log | samtools view -bS - -o output/HG003_bwa.bam
按坐标排序。
samtools sort -@ 4 -o output/HG003_sort.bam output/HG003_bwa.bam
标记重复序列。
java -Xmx16g -jar /sibcb2/bioinformatics/software/GATK/picard-2.26.10/picard.jar MarkDuplicates \
--INPUT output/HG003_sort.bam \
--OUTPUT output/HG003_dup.bam \
--METRICS_FILE output/duplicate_metrics_HG003.txt \
--VALIDATION_STRINGENCY LENIENT
samtools index output/HG003_dup.bam
查看比对后结果。
samtools flagstat output/HG003_dup.bam
73763186 + 0 in total (QC-passed reads + QC-failed reads)
73543530 + 0 primary
0 + 0 secondary
219656 + 0 supplementary
11459791 + 0 duplicates
11459791 + 0 primary duplicates
73581224 + 0 mapped (99.75% : N/A)
73361568 + 0 primary mapped (99.75% : N/A)
73543530 + 0 paired in sequencing
36771765 + 0 read1
36771765 + 0 read2
72579058 + 0 properly paired (98.69% : N/A)
73282852 + 0 with itself and mate mapped
78716 + 0 singletons (0.11% : N/A)
385876 + 0 with mate mapped to a different chr
310526 + 0 with mate mapped to a different chr (mapQ>=5)
和原来下载的BAM比较一下。
samtools flagstat bam/HG003.novaseq.wes_idt.100x.dedup.bam
73763195 + 0 in total (QC-passed reads + QC-failed reads)
73543530 + 0 primary
0 + 0 secondary
219665 + 0 supplementary
11460110 + 0 duplicates
11460110 + 0 primary duplicates
73581237 + 0 mapped (99.75% : N/A)
73361572 + 0 primary mapped (99.75% : N/A)
73543530 + 0 paired in sequencing
36771765 + 0 read1
36771765 + 0 read2
72578704 + 0 properly paired (98.69% : N/A)
73282860 + 0 with itself and mate mapped
78712 + 0 singletons (0.11% : N/A)
385908 + 0 with mate mapped to a different chr
310575 + 0 with mate mapped to a different chr (mapQ>=5)
samtools view output/HG003_dup.bam | grep A00744:48:HV33LDSXX:3:2632:21531:360
显示Read 1的序列:
zcat FASTQ/HG003_R1.fastq.gz | grep -A3 A00744:48:HV33LDSXX:3:2632:21531:3602
这条序列的后94个碱基比对到chr1,943362,+,57S94M,60,0
,同时这条序列的后65个碱基比对到chr1,943266,-,86S65M,60,0
,比对的质量都比较高。我们可以使用BLAT看看这个Read的比对情况。
可以看出,当比对为负链的时候,BAM中保存的序列的反向互补链。从BLAT可以看出,虽然是前65个碱基比对到负链(chr1,943266,-,86S65M,60,0
),但在BAM文件中显示的后65个碱基。
需要注意的是,提高mapping score无法去除Supplementary alignments。
samtools view -q 30 output/HG003_dup.bam | grep -e 'SA:Z:' | wc -l
上面的命令返回231873。
SA
tag标记的是Supplementary Alignments,而XA
标记的是Alternative alignments。
samtools view -q 5 -f 3 output/HG003_dup.bam | grep -e 'XA:Z:' | wc -l
samtools view output/HG003_dup.bam | grep A00744:48:HV33LDSXX:3:1210:31747:11851
该命令找到5073821个reads,其中的一个例子如下
我们用BLAT确认其Read 2的位置,发现它的确可以比对到多个位置,错配数也大致类似。最后保留的一个位置和Read 1有合适的距离(read mapped in proper pair
),然而因为多重比对以及错配的原因,mapping score也不高。
需要注意的是,当我们提高mapping score的时候,并不能过滤掉多重比对的reads。
samtools view -q 30 -f 3 output/HG003_dup.bam | grep -e 'XA:Z:' | wc -l
上面的命令返回结果为4189900。
samtools view -q 5 output/HG003_dup.bam | awk '$7 != "=" {print $0}' | wc -l
共找到367085个Reads,Read 1和Read 2比对到不同的染色体,其中的一个例子如下所示:
在这个例子中,Read 1比对到chr1, 16168,+,14S137M,22,0
, Read 1比对到chr7,132979905,+,117M34S,60,2
。需要注意的是,Read 2还有个Supplementary Alignment:chr2,113597218,-,51M100S,0,0
。
samtools view -hb -f 12 output/HG003_dup.bam -o output/HG003_unmapped.bam
samtools view output/HG003_unmapped.bam | wc -l
samtools flagstat output/HG003_unmapped.bam
103246 + 0 in total (QC-passed reads + QC-failed reads)
103246 + 0 primary
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
0 + 0 primary duplicates
0 + 0 mapped (0.00% : N/A)
0 + 0 primary mapped (0.00% : N/A)
103246 + 0 paired in sequencing
51623 + 0 read1
51623 + 0 read2
0 + 0 properly paired (0.00% : N/A)
0 + 0 with itself and mate mapped
0 + 0 singletons (0.00% : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)
共有103246个测序片段没有比对到基因组上,其中的一对如下:
> A00744:48:HV33LDSXX:3:1101:11713:6370
ATTTTCACTCTGCTCTTTCAATTTCATCAAGCGGATCTTTAGTTCTTCTCTTTCTGCAATAAGGGTGGTGTCTGCCTATCTGAGGTTATTGATATTTCTCCTGGCAATCTTGATTCCAACTTGTTTTATCCACCCCAGCATTTCAAATAAT
> A00744:48:HV33LDSXX:3:1101:11713:6370
AGACCCTGATCCTAGAAGAGATTGAAGGCATGAAGAGAAGGGGACCACAGAGGATGAGATGGTTGGATGGTATCACTGACTCAATGGACATAAGTTTGAGTAAATTCAGGGAGTTGCTAATGGACAGGGAAGCCAGGCTTGCTTCAGTCCA
使用BLAT比对人类基因组,发现没有比对上;使用BLAST比对到多物种发现,这些unmapped Reads能够高质量地比对到Bos taurus基因组:
首先产生一个带文件头的interval list文件:
/sibcb2/bioinformatics/software/GATK/gatk-4.2.4.1/gatk BedToIntervalList \
--SEQUENCE_DICTIONARY output/HG003_dup.bam \
--INPUT benchmark/HG003_GRCh38_1_22_v4.2.1_benchmark_noinconsistent.bed \
--OUTPUT output/HG003_benchmark.interval_list
/sibcb2/bioinformatics/software/GATK/gatk-4.2.4.1/gatk BedToIntervalList \
--SEQUENCE_DICTIONARY output/HG003_dup.bam \
--INPUT bam/idt_capture_novogene.grch38.bed \
--OUTPUT output/HG003_WES.interval_list
/sibcb2/bioinformatics/software/GATK/gatk-4.2.4.1/gatk --java-options "-Xmx32G" CollectHsMetrics \
--BAIT_INTERVALS output/HG003_WES.interval_list \
--TARGET_INTERVALS output/HG003_WES.interval_list \
--INPUT output/HG003_dup.bam \
--OUTPUT output/HG003_HsMetrics.tsv
查看产生的HG003_HsMetrics.tsv文件:
tx = readLines("output/HG003_HsMetrics.tsv")
flag = grep("BAIT_SET", tx)
name = strsplit(tx[[flag]], '\t')[[1]]
value = strsplit(tx[[flag+1]], '\t')[[1]]
name = name[1:length(value)]
out = data.frame(name, value)
out[1:14, ]
1 BAIT_SET HG003_WES
2 BAIT_TERRITORY 38860924
3 BAIT_DESIGN_EFFICIENCY 1
4 ON_BAIT_BASES 6043526301
5 NEAR_BAIT_BASES 3915226859
6 OFF_BAIT_BASES 1020455025
7 PCT_SELECTED_BASES 0.907056 (捕获效率)
8 PCT_OFF_BAIT 0.092944
9 ON_BAIT_VS_SELECTED 0.606856
10 MEAN_BAIT_COVERAGE 155.516794 (平均覆盖度)
11 PCT_USABLE_BASES_ON_BAIT 0.544213
12 PCT_USABLE_BASES_ON_TARGET 0.344235
13 FOLD_ENRICHMENT 43.909361
14 HS_LIBRARY_SIZE 93292336
计算靶向区域的覆盖度 (运行时间较长)。请注意:By default Duplicate Marked and non-primary alignments are not counted。
/sibcb2/bioinformatics/software/GATK/gatk-4.2.4.1/gatk --java-options "-Xmx16G" DepthOfCoverage \
--input output/HG003_dup.bam \
--intervals output/HG003_WES.interval_list \
--reference /sibcb2/bioinformatics/Projects/Course/BWA/reference/GRCh38.fasta \
--summary-coverage-threshold 5 --summary-coverage-threshold 10 \
--omit-depth-output-at-each-base \
--output output/HG003_cov
上面的分析,产生了如下的文件。
-rw-r--r-- 1 biocourse01 ods 7880 Apr 10 10:49 HG003_cov.sample_cumulative_coverage_counts
-rw-r--r-- 1 biocourse01 ods 6416 Apr 10 10:49 HG003_cov.sample_cumulative_coverage_proportions
-rw-r--r-- 1 biocourse01 ods 7636 Apr 10 10:49 HG003_cov.sample_interval_statistics
-rw-r--r-- 1 biocourse01 ods 30641425 Apr 10 10:49 HG003_cov.sample_interval_summary
-rw-r--r-- 1 biocourse01 ods 11351 Apr 10 10:49 HG003_cov.sample_statistics
-rw-r--r-- 1 biocourse01 ods 196 Apr 10 10:49 HG003_cov.sample_summary
查看区间覆盖度文件HG003_cov.sample_interval_summary
,可以筛选覆盖度低的区间,它有如下的列:
Target
total_coverage
average_coverage
HG003_total_cvg
HG003_mean_cvg
HG003_granular_Q1
HG003_granular_median
HG003_granular_Q3
HG003_%_above_5
HG003_%_above_10
通过这个文件,我们可以计算捕获区间的平均覆盖度:sum(total_coverage)/target size。
# R script
summaryFile = "output/HG003_cov.sample_interval_summary"
tx = read.table(file = summaryFile, row.names = NULL, header = TRUE, sep = ",")
gr = GRanges(as.character(tx$Target))
gr$total_coverage = as.numeric(tx$total_coverage)
depth = sum(gr$total_coverage)/sum(as.numeric(width(gr)))
计算得到的结果depth = 130.4008
。这和HG003_cov.sample_summary
文件中的结果一致。
sample_id,total,mean,granular_third_quartile,granular_median,granular_first_quartile,%_bases_above_5,%_bases_above_10
HG003,5067495806,130.65,161,115,80,99.9,99.8
cumulativeFile = "../output/HG003_cov.sample_cumulative_coverage_counts"
tx = read.table(file = cumulativeFile, row.names = 1, header = TRUE, sep = ",")
cov = as.numeric(sapply(strsplit(colnames(tx), "_"), function(z) z[2]))
nov = cov/130.4008
Frac = as.numeric(tx[1,])/as.numeric(tx[1,1])
out = data.frame(cov, nov, Frac)
plot(nov, Frac, type = 'l', lwd = 2, col = 'red', main = 'Full range',
xlab = "Normalized coverage", ylab = "Fraction of bases")
cumulativeFile = "../output/HG003_cov.sample_cumulative_coverage_counts"
tx = read.table(file = cumulativeFile, row.names = 1, header = TRUE, sep = ",")
cov = as.numeric(sapply(strsplit(colnames(tx), "_"), function(z) z[2]))
nov = cov/130.4008
Frac = as.numeric(tx[1,])/as.numeric(tx[1,1])
out = data.frame(cov, nov, Frac)
plot(nov[1:50], Frac[1:50], type = 'l', lwd = 2, col = 'red', main = 'Top range',
xlab = "Normalized coverage", ylab = "Fraction of bases")