Bulk RNA-seq的比对需要参考基因组、FASTQ文件,我们可以通过建立软链接的方式准备这些文件:
# 链接参考基因组
ln -s /sibcb1/bioinformatics/biocourse/biocourse01/bulkRNAseq/idx idx
# 链接FASTQ文件
ln -s /sibcb1/bioinformatics/biocourse/biocourse01/bulkRNAseq/FASTQ FASTQ
# 链接trimmed FASTQ文件
ln -s /sibcb1/bioinformatics/biocourse/biocourse01/bulkRNAseq/trimmed trimmed
wget http://genomedata.org/rnaseq-tutorial/illumina_multiplex.fa
在处理自己数据的时候,也需要去除其他已知接头序列。
/sibcb1/bioinformatics/training_shared/software/fastp \
-i FASTQ/ctrl1_1.fastq.gz -I FASTQ/ctrl1_2.fastq.gz \
-o trimmed/ctrl1_1.fastq.gz -O trimmed/ctrl1_2.fastq.gz \
-l 25 \
-w 12 \
--adapter_fasta FASTQ/illumina_multiplex.fa \
--json trimmed/ctrl1.json \
--html trimmed/ctrl1.html \
2>trimmed/ctrl1.fastp.log
其中’-l 25’设置的是最短的长度,‘-w 12’设置的是线程数,’–adapter_fasta’设置的是接头序列文件。
samples=("ctrl1" "ctrl2" "ctrl3" "stroke1" "stroke2" "stroke3")
for i in ${samples[@]}; do
echo $i
/sibcb1/bioinformatics/training_shared/software/fastp \
-i FASTQ/$i\_1.fastq.gz -I FASTQ/$i\_2.fastq.gz \
-o trimmed/$i\_1.fastq.gz -O trimmed/$i\_2.fastq.gz \
-l 25 \
-w 12 \
--adapter_fasta FASTQ/illumina_multiplex.fa \
--json trimmed/$i.json \
--html trimmed/$i.html \
2>trimmed/$i.fastp.log
done
处理完成后,我们会得到如下文件:
trimmed
├── ctrl1_1.fastq.gz
├── ctrl1_2.fastq.gz
├── ctrl1.fastp.log
├── ctrl1.html
├── ctrl1.json
├── ctrl2_1.fastq.gz
├── ctrl2_2.fastq.gz
├── ctrl2.fastp.log
├── ctrl2.html
├── ctrl2.json
├── ctrl3_1.fastq.gz
├── ctrl3_2.fastq.gz
├── ctrl3.fastp.log
├── ctrl3.html
├── ctrl3.json
├── stroke1_1.fastq.gz
├── stroke1_2.fastq.gz
├── stroke1.fastp.log
├── stroke1.html
├── stroke1.json
├── stroke2_1.fastq.gz
├── stroke2_2.fastq.gz
├── stroke2.fastp.log
├── stroke2.html
├── stroke2.json
├── stroke3_1.fastq.gz
├── stroke3_2.fastq.gz
├── stroke3.fastp.log
├── stroke3.html
└── stroke3.json
wget -c https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_34/GRCh38.primary_assembly.genome.fa.gz
wget -c https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_34/gencode.v34.primary_assembly.annotation.gtf.gz
wget -c https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_34/gencode.v34.transcripts.fa.gz
/sibcb/program/install/star-2.7.11b/STAR \
--runThreadN 24 \
--runMode genomeGenerate \
--genomeFastaFiles GRCh38.primary_assembly.genome.fa \
--sjdbGTFfile gencode.v34.primary_assembly.annotation.gtf \
--genomeDir star
上述命令运行大致30分钟,生成的index文件保存在star
目录下,有如下文件:
star
├── chrLength.txt
├── chrNameLength.txt
├── chrName.txt
├── chrStart.txt
├── exonGeTrInfo.tab
├── exonInfo.tab
├── geneInfo.tab
├── Genome
├── genomeParameters.txt
├── SA
├── SAindex
├── sjdbInfo.txt
├── sjdbList.fromGTF.out.tab
├── sjdbList.out.tab
└── transcriptInfo.tab
/sibcb/program/install/star-2.7.11b/STAR \
--runThreadN 12 \
--genomeDir idx/star \
--quantMode GeneCounts \
--readFilesIn trimmed/ctrl1_1.fastq.gz trimmed/ctrl1_2.fastq.gz \
--twopassMode Basic \
--readFilesCommand zcat \
--outFilterType BySJout \
--outSAMtype BAM SortedByCoordinate \
--outFileNamePrefix alignments/ctrl1
产生的输出文件:
├── ctrl1Aligned.sortedByCoord.out.bam
├── ctrl1Log.final.out
├── ctrl1Log.out
├── ctrl1Log.progress.out
├── ctrl1ReadsPerGene.out.tab
├── ctrl1SJ.out.tab
├── ctrl1_STARgenome
├── ctrl1_STARpass1
├── ctrl1_STARtm
产生的质控文件ctrl1Log.final.out
:
Started job on | Nov 05 12:19:03
Started mapping on | Nov 05 12:29:52
Finished on | Nov 05 12:49:52
Mapping speed, Million of reads per hour | 159.57
Number of input reads | 53190364
Average input read length | 299
UNIQUE READS:
Uniquely mapped reads number | 49132175
Uniquely mapped reads % | 92.37%
Average mapped length | 296.78
Number of splices: Total | 29376935
Number of splices: Annotated (sjdb) | 29376360
Number of splices: GT/AG | 29158865
Number of splices: GC/AG | 166274
Number of splices: AT/AC | 22033
Number of splices: Non-canonical | 29763
Mismatch rate per base, % | 0.60%
Deletion rate per base | 0.01%
Deletion average length | 1.92
Insertion rate per base | 0.02%
Insertion average length | 1.63
MULTI-MAPPING READS:
Number of reads mapped to multiple loci | 3018586
% of reads mapped to multiple loci | 5.68%
Number of reads mapped to too many loci | 21277
% of reads mapped to too many loci | 0.04%
UNMAPPED READS:
Number of reads unmapped: too many mismatches | 0
% of reads unmapped: too many mismatches | 0.00%
Number of reads unmapped: too short | 981456
% of reads unmapped: too short | 1.85%
Number of reads unmapped: other | 36870
% of reads unmapped: other | 0.07%
CHIMERIC READS:
Number of chimeric reads | 0
% of chimeric reads | 0.00%
产生的基因水平的计数文件ctrl1ReadsPerGene.out.tab
:
N_unmapped 1039603 1039603 1039603
N_multimapping 3018586 3018586 3018586
N_noFeature 15823849 45453461 17862679
N_ambiguous 4329897 232217 2593903
ENSG00000223972.5 61 8 54
ENSG00000227232.5 94 6 89
ENSG00000278267.1 0 0 0
ENSG00000243485.5 0 0 0
ENSG00000284332.1 0 0 0
ENSG00000237613.2 0 0 0
这四列的注释: - column 1: gene ID - column 2: counts for unstranded RNA-seq - column 3: counts for the 1st read strand aligned with RNA (htseq-count option -s yes) - column 4: counts for the 2nd read strand aligned with RNA (htseq-count option -s reverse)
一般情况下,我们的数据都是非链特异的,使用前两列即可。
samples=("ctrl1" "ctrl2" "ctrl3" "stroke1" "stroke2" "stroke3")
for i in ${samples[@]}; do
echo $i
/sibcb/program/install/star-2.7.11b/STAR \
--runThreadN 12 \
--genomeDir idx/star \
--quantMode GeneCounts \
--readFilesIn trimmed/$i\_1.fastq.gz trimmed/$i\_2.fastq.gz \
--twopassMode Basic \
--readFilesCommand zcat \
--outFilterType BySJout \
--outSAMtype BAM SortedByCoordinate \
--outFileNamePrefix alignments/$i
done
/sibcb/program/bin/featureCounts \
-T 12 \
-a idx/gencode.v34.primary_assembly.annotation.gtf \
-o featureCounts/featureCounts_table.txt \
-p -B -C -t exon -g gene_id --extraAttributes gene_name \
\
alignments/ctrl1Aligned.sortedByCoord.out.bam \
alignments/ctrl2Aligned.sortedByCoord.out.bam \
alignments/ctrl3Aligned.sortedByCoord.out.bam \
alignments/stroke1Aligned.sortedByCoord.out.bam \
alignments/stroke2Aligned.sortedByCoord.out.bam alignments/stroke3Aligned.sortedByCoord.out.bam
‘-p’: If specified, fragments (or templates) will be counted instead of reads. This option is only applicable for paired-end reads.
-B
: Only count read pairs that have both ends aligned.
‘-C’: Do not count read pairs that have their two ends mapping to different chromosomes or mapping to same chromosome but on different strands.
-t
: This option is for counting against your particular feature of interest. Most often we are interested in “Is my read in the exon coordinates range of a gene?” hence the default.
-g
: Specify attribute type in GTF annotation to use as the feature ID.
if you specify ‘-t transcript’ you can only group by ‘-g transcript_id’ or ‘-g gene_id’ (default) but you cannot group by ‘-g exon_id’. On the other hand if you specify ‘-t exon’ (default) you can group by ‘-g exon_id’ or ‘-g transcript_id’ or ‘-g gene_id’. And with three_prime_utr they only have transcript_id and gene_id attributes (make sense, because utr belongs to the transcript).
如果只计算exon的reads,那么设定为’-t exon’;如果要包括整个基因区间的reads,那么设定为’-t gene’。
运行后显示如下信息:
========== _____ _ _ ____ _____ ______ _____
===== / ____| | | | _ \| __ \| ____| /\ | __ \
===== | (___ | | | | |_) | |__) | |__ / \ | | | |
==== \___ \| | | | _ <| _ /| __| / /\ \ | | | |
==== ____) | |__| | |_) | | \ \| |____ / ____ \| |__| |
========== |_____/ \____/|____/|_| \_\______/_/ \_\_____/
v1.6.4
//========================== featureCounts setting ===========================\\
|| ||
|| Input files : 6 BAM files ||
|| P ctrl1Aligned.sortedByCoord.out.bam ||
|| P ctrl2Aligned.sortedByCoord.out.bam ||
|| P ctrl3Aligned.sortedByCoord.out.bam ||
|| P stroke1Aligned.sortedByCoord.out.bam ||
|| P stroke2Aligned.sortedByCoord.out.bam ||
|| P stroke3Aligned.sortedByCoord.out.bam ||
|| ||
|| Output file : featureCounts_table.txt ||
|| Summary : featureCounts_table.txt.summary ||
|| Annotation : gencode.v34.primary_assembly.annotation.gtf ... ||
|| Dir for temp files : featureCounts ||
|| ||
|| Threads : 12 ||
|| Level : meta-feature level ||
|| Paired-end : yes ||
|| Multimapping reads : not counted ||
|| Multi-overlapping reads : not counted ||
|| Min overlapping bases : 1 ||
|| ||
|| Chimeric reads : not counted ||
|| Both ends mapped : required ||
|| ||
\\============================================================================//
运行时间大概20分钟,获得基因水平的计数矩阵:
# head featureCounts/featureCounts_table.txt | grep -v \#
.5 chr1 11869 14409 + 2541 DDX11L1 63 5 15 16 17 42
ENSG00000223972.5 chr1 14404 29570 - 15167 WASH7P 306 352 268 69 181 89
ENSG00000227232.1 chr1 17369 17436 - 68 MIR6859-1 0 0 0 0 0 0
ENSG00000278267.5 chr1 29554 31109 + 1556 MIR1302-2HG 0 0 0 0 0 0
ENSG00000243485.1 chr1 30366 30503 + 138 MIR1302-2 0 0 0 0 0 0
ENSG00000284332.2 chr1 34554 36081 - 1528 FAM138A 0 0 0 0 0 0
ENSG00000237613.3 chr1 52473 53312 + 840 OR4G4P 0 0 0 0 0 0
ENSG00000268020.2 chr1 57598 64116 + 6519 OR4G11P 0 0 0 0 0 0 ENSG00000240361
下载转录本FASTA
文件:
wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_34/gencode.v34.transcripts.fa.gz
简化转录本的名字:
# gencode.v34.transcripts.fa.gz中原来的名字
>ENST00000456328.2|ENSG00000223972.5|OTTHUMG00000000961.2|OTTHUMT00000362751.1|DDX11L1-202|DDX11L1|1657|processed_transcript|
# gencode.v34.transcripts_lite.fa中简化后的名字
>ENST00000456328.2
# 获得转录本ID的注释,保存在tx2g_symbol.txt中
ENST00000456328.2 DDX11L1 ENSG00000223972.5
ENST00000450305.2 DDX11L1 ENSG00000223972.5
ENST00000488147.1 WASH7P ENSG00000227232.5
建立kallisto
使用的Index:
/sibcb1/bioinformatics/biocourse/biocourse01/bulkRNAseq/software/kallisto index \
-i transcriptome.idx gencode.v34.transcripts_lite.fa
使用kallisto
定量:
/sibcb1/bioinformatics/biocourse/biocourse01/bulkRNAseq/software/kallisto quant \
-t 12 \
-b 100 \
-i idx/kallisto/transcriptome.idx \
-o kallisto/ctrl1 \
trimmed/ctrl1_1.fastq.gz trimmed/ctrl1_2.fastq.gz
产生的结果文件:
# head kallisto/ctrl1/abundance.tsv
target_id length eff_length est_counts tpm.2 1657 1404.57 167.725 1.75935
ENST00000456328.2 632 379.827 0 0
ENST00000450305.1 1351 1098.57 48.9905 0.657027
ENST00000488147.1 68 27.431 0 0
ENST00000619216.1 712 459.699 0 0
ENST00000473358.1 535 283.053 0 0
ENST00000469289.1 138 65.0234 0 0
ENST00000607096.1 1187 934.569 0 0
ENST00000417324.1 590 337.928 99.0074 4.31661 ENST00000461467
批量运行所有的样本:
samples=("ctrl2" "ctrl3" "stroke1" "stroke2" "stroke3")
for id in ${samples[@]}; do
echo $id
trim1=trimmed/${id}_1.fastq.gz
trim2=trimmed/${id}_2.fastq.gz
/sibcb1/bioinformatics/biocourse/biocourse01/bulkRNAseq/software/kallisto quant \
-t 6 \
-b 100
-i idx/kallisto/transcriptome.idx \
-o kallisto/$id $trim1 $trim2
done
先建立index
:
/sibcb1/bioinformatics/biocourse/biocourse01/bulkRNAseq/software/salmon-latest_linux_x86_64/bin/salmon index \
-t idx/kallisto/gencode.v34.transcripts.fa \
-i idx/salmon/transcriptome.idx \
-k 31 --gencode
使用salmon
定量:
/sibcb1/bioinformatics/biocourse/biocourse01/bulkRNAseq/software/salmon-latest_linux_x86_64/bin/salmon quant \
--libType IU \
--threads 12 \
--numBootstraps 100 \
--index idx/salmon/transcriptome.idx \
--gcBias \
--output salmon/ctrl1 \
--mates1 trimmed/ctrl1_1.fastq.gz --mates2 trimmed/ctrl1_2.fastq.gz
其中libType
的设置,参考其文档: