Bioinformatics training course: bulk RNA-seq: RNA-seq alignment

开始前的准备工作

Bulk RNA-seq的比对需要参考基因组、FASTQ文件，我们可以通过建立软链接的方式准备这些文件：


# 链接参考基因组
ln -s /sibcb1/bioinformatics/biocourse/biocourse01/bulkRNAseq/idx idx

# 链接FASTQ文件
ln -s /sibcb1/bioinformatics/biocourse/biocourse01/bulkRNAseq/FASTQ FASTQ

# 链接trimmed FASTQ文件
ln -s /sibcb1/bioinformatics/biocourse/biocourse01/bulkRNAseq/trimmed trimmed

使用fastp去除接头

下载测序接头序列

wget http://genomedata.org/rnaseq-tutorial/illumina_multiplex.fa

在处理自己数据的时候，也需要去除其他已知接头序列。

一个例子

/sibcb1/bioinformatics/training_shared/software/fastp \
    -i FASTQ/ctrl1_1.fastq.gz -I FASTQ/ctrl1_2.fastq.gz \
    -o trimmed/ctrl1_1.fastq.gz -O trimmed/ctrl1_2.fastq.gz \
    -l 25 \
  -w 12 \
    --adapter_fasta FASTQ/illumina_multiplex.fa \
    --json trimmed/ctrl1.json \
    --html trimmed/ctrl1.html \
    2>trimmed/ctrl1.fastp.log

其中’-l 25’设置的是最短的长度，‘-w 12’设置的是线程数，’–adapter_fasta’设置的是接头序列文件。

批量处理

samples=("ctrl1" "ctrl2" "ctrl3" "stroke1" "stroke2" "stroke3")
for i in ${samples[@]}; do
  echo $i
  /sibcb1/bioinformatics/training_shared/software/fastp \
    -i FASTQ/$i\_1.fastq.gz -I FASTQ/$i\_2.fastq.gz \
    -o trimmed/$i\_1.fastq.gz -O trimmed/$i\_2.fastq.gz \
    -l 25 \
    -w 12 \
    --adapter_fasta FASTQ/illumina_multiplex.fa \
    --json trimmed/$i.json \
    --html trimmed/$i.html \
    2>trimmed/$i.fastp.log
done

处理完成后，我们会得到如下文件：

trimmed
├── ctrl1_1.fastq.gz
├── ctrl1_2.fastq.gz
├── ctrl1.fastp.log
├── ctrl1.html
├── ctrl1.json
├── ctrl2_1.fastq.gz
├── ctrl2_2.fastq.gz
├── ctrl2.fastp.log
├── ctrl2.html
├── ctrl2.json
├── ctrl3_1.fastq.gz
├── ctrl3_2.fastq.gz
├── ctrl3.fastp.log
├── ctrl3.html
├── ctrl3.json
├── stroke1_1.fastq.gz
├── stroke1_2.fastq.gz
├── stroke1.fastp.log
├── stroke1.html
├── stroke1.json
├── stroke2_1.fastq.gz
├── stroke2_2.fastq.gz
├── stroke2.fastp.log
├── stroke2.html
├── stroke2.json
├── stroke3_1.fastq.gz
├── stroke3_2.fastq.gz
├── stroke3.fastp.log
├── stroke3.html
└── stroke3.json

为STAR建立index

下载参考基因组和GTF注释文件

wget -c https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_34/GRCh38.primary_assembly.genome.fa.gz
wget -c https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_34/gencode.v34.primary_assembly.annotation.gtf.gz
wget -c https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_34/gencode.v34.transcripts.fa.gz

建立STAR index

/sibcb/program/install/star-2.7.11b/STAR \
  --runThreadN 24 \
  --runMode genomeGenerate \
  --genomeFastaFiles GRCh38.primary_assembly.genome.fa \
  --sjdbGTFfile gencode.v34.primary_assembly.annotation.gtf \
  --genomeDir star

上述命令运行大致30分钟，生成的index文件保存在star目录下，有如下文件：

  star
  ├── chrLength.txt
  ├── chrNameLength.txt
  ├── chrName.txt
  ├── chrStart.txt
  ├── exonGeTrInfo.tab
  ├── exonInfo.tab
  ├── geneInfo.tab
  ├── Genome
  ├── genomeParameters.txt
  ├── SA
  ├── SAindex
  ├── sjdbInfo.txt
  ├── sjdbList.fromGTF.out.tab
  ├── sjdbList.out.tab
  └── transcriptInfo.tab

使用STAR比对

一个例子

/sibcb/program/install/star-2.7.11b/STAR \
  --runThreadN  12 \
  --genomeDir idx/star \
  --quantMode GeneCounts \
  --readFilesIn trimmed/ctrl1_1.fastq.gz trimmed/ctrl1_2.fastq.gz \
  --twopassMode Basic \
  --readFilesCommand zcat \
  --outFilterType BySJout \
  --outSAMtype BAM SortedByCoordinate \
  --outFileNamePrefix alignments/ctrl1

产生的输出文件：

├── ctrl1Aligned.sortedByCoord.out.bam
├── ctrl1Log.final.out
├── ctrl1Log.out
├── ctrl1Log.progress.out
├── ctrl1ReadsPerGene.out.tab
├── ctrl1SJ.out.tab
├── ctrl1_STARgenome
├── ctrl1_STARpass1
├── ctrl1_STARtm

产生的质控文件ctrl1Log.final.out:

                                 Started job on | Nov 05 12:19:03
                             Started mapping on | Nov 05 12:29:52
                                    Finished on | Nov 05 12:49:52
       Mapping speed, Million of reads per hour | 159.57

                          Number of input reads | 53190364
                      Average input read length | 299
                                    UNIQUE READS:
                   Uniquely mapped reads number | 49132175
                        Uniquely mapped reads % | 92.37%
                          Average mapped length | 296.78
                       Number of splices: Total | 29376935
            Number of splices: Annotated (sjdb) | 29376360
                       Number of splices: GT/AG | 29158865
                       Number of splices: GC/AG | 166274
                       Number of splices: AT/AC | 22033
               Number of splices: Non-canonical | 29763
                      Mismatch rate per base, % | 0.60%
                         Deletion rate per base | 0.01%
                        Deletion average length | 1.92
                        Insertion rate per base | 0.02%
                       Insertion average length | 1.63
                             MULTI-MAPPING READS:
        Number of reads mapped to multiple loci | 3018586
             % of reads mapped to multiple loci | 5.68%
        Number of reads mapped to too many loci | 21277
             % of reads mapped to too many loci | 0.04%
                                  UNMAPPED READS:
  Number of reads unmapped: too many mismatches | 0
       % of reads unmapped: too many mismatches | 0.00%
            Number of reads unmapped: too short | 981456
                 % of reads unmapped: too short | 1.85%
                Number of reads unmapped: other | 36870
                     % of reads unmapped: other | 0.07%
                                  CHIMERIC READS:
                       Number of chimeric reads | 0
                            % of chimeric reads | 0.00%

产生的基因水平的计数文件ctrl1ReadsPerGene.out.tab:

N_unmapped  1039603 1039603 1039603
N_multimapping  3018586 3018586 3018586
N_noFeature 15823849  45453461  17862679
N_ambiguous 4329897 232217  2593903
ENSG00000223972.5 61  8 54
ENSG00000227232.5 94  6 89
ENSG00000278267.1 0 0 0
ENSG00000243485.5 0 0 0
ENSG00000284332.1 0 0 0
ENSG00000237613.2 0 0 0

这四列的注释： - column 1: gene ID - column 2: counts for unstranded RNA-seq - column 3: counts for the 1st read strand aligned with RNA (htseq-count option -s yes) - column 4: counts for the 2nd read strand aligned with RNA (htseq-count option -s reverse)

一般情况下，我们的数据都是非链特异的，使用前两列即可。

批量处理

samples=("ctrl1" "ctrl2" "ctrl3" "stroke1" "stroke2" "stroke3")
for i in ${samples[@]}; do  
  echo $i
  /sibcb/program/install/star-2.7.11b/STAR \
    --runThreadN  12 \
    --genomeDir idx/star \
    --quantMode GeneCounts \
    --readFilesIn trimmed/$i\_1.fastq.gz trimmed/$i\_2.fastq.gz \
    --twopassMode Basic \
    --readFilesCommand zcat \
    --outFilterType BySJout \
    --outSAMtype BAM SortedByCoordinate \
    --outFileNamePrefix alignments/$i
done

Counting Reads

featureCounts

/sibcb/program/bin/featureCounts \
  -T 12 \
  -a idx/gencode.v34.primary_assembly.annotation.gtf \
  -o featureCounts/featureCounts_table.txt \
  -p -B -C -t exon -g gene_id --extraAttributes gene_name \
  alignments/ctrl1Aligned.sortedByCoord.out.bam \
  alignments/ctrl2Aligned.sortedByCoord.out.bam \
  alignments/ctrl3Aligned.sortedByCoord.out.bam \
  alignments/stroke1Aligned.sortedByCoord.out.bam \
  alignments/stroke2Aligned.sortedByCoord.out.bam \
  alignments/stroke3Aligned.sortedByCoord.out.bam

‘-p’: If specified, fragments (or templates) will be counted instead of reads. This option is only applicable for paired-end reads. -B: Only count read pairs that have both ends aligned. ‘-C’: Do not count read pairs that have their two ends mapping to different chromosomes or mapping to same chromosome but on different strands. -t: This option is for counting against your particular feature of interest. Most often we are interested in “Is my read in the exon coordinates range of a gene?” hence the default. -g: Specify attribute type in GTF annotation to use as the feature ID.

if you specify ‘-t transcript’ you can only group by ‘-g transcript_id’ or ‘-g gene_id’ (default) but you cannot group by ‘-g exon_id’. On the other hand if you specify ‘-t exon’ (default) you can group by ‘-g exon_id’ or ‘-g transcript_id’ or ‘-g gene_id’. And with three_prime_utr they only have transcript_id and gene_id attributes (make sense, because utr belongs to the transcript).

如果只计算exon的reads，那么设定为’-t exon’；如果要包括整个基因区间的reads，那么设定为’-t gene’。

运行后显示如下信息：

        ==========     _____ _    _ ____  _____  ______          _____  
        =====         / ____| |  | |  _ \|  __ \|  ____|   /\   |  __ \ 
          =====      | (___ | |  | | |_) | |__) | |__     /  \  | |  | |
            ====      \___ \| |  | |  _ <|  _  /|  __|   / /\ \ | |  | |
              ====    ____) | |__| | |_) | | \ \| |____ / ____ \| |__| |
        ==========   |_____/ \____/|____/|_|  \_\______/_/    \_\_____/
          v1.6.4

//========================== featureCounts setting ===========================\\
||                                                                            ||
||             Input files : 6 BAM files                                      ||
||                           P ctrl1Aligned.sortedByCoord.out.bam             ||
||                           P ctrl2Aligned.sortedByCoord.out.bam             ||
||                           P ctrl3Aligned.sortedByCoord.out.bam             ||
||                           P stroke1Aligned.sortedByCoord.out.bam           ||
||                           P stroke2Aligned.sortedByCoord.out.bam           ||
||                           P stroke3Aligned.sortedByCoord.out.bam           ||
||                                                                            ||
||             Output file : featureCounts_table.txt                          ||
||                 Summary : featureCounts_table.txt.summary                  ||
||              Annotation : gencode.v34.primary_assembly.annotation.gtf  ... ||
||      Dir for temp files : featureCounts                                    ||
||                                                                            ||
||                 Threads : 12                                               ||
||                   Level : meta-feature level                               ||
||              Paired-end : yes                                              ||
||      Multimapping reads : not counted                                      ||
|| Multi-overlapping reads : not counted                                      ||
||   Min overlapping bases : 1                                                ||
||                                                                            ||
||          Chimeric reads : not counted                                      ||
||        Both ends mapped : required                                         ||
||                                                                            ||
\\============================================================================//

运行时间大概20分钟，获得基因水平的计数矩阵：

# head featureCounts/featureCounts_table.txt | grep -v \#
ENSG00000223972.5       chr1    11869   14409   +       2541    DDX11L1 63      5       15      16      17      42
ENSG00000227232.5       chr1    14404   29570   -       15167   WASH7P  306     352     268     69      181     89
ENSG00000278267.1       chr1    17369   17436   -       68      MIR6859-1       0       0       0       0       0       0
ENSG00000243485.5       chr1    29554   31109   +       1556    MIR1302-2HG     0       0       0       0       0       0
ENSG00000284332.1       chr1    30366   30503   +       138     MIR1302-2       0       0       0       0       0       0
ENSG00000237613.2       chr1    34554   36081   -       1528    FAM138A 0       0       0       0       0       0
ENSG00000268020.3       chr1    52473   53312   +       840     OR4G4P  0       0       0       0       0       0
ENSG00000240361.2       chr1    57598   64116   +       6519    OR4G11P 0       0       0       0       0       0

kallisto

下载转录本FASTA文件：

wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_34/gencode.v34.transcripts.fa.gz

简化转录本的名字：

# gencode.v34.transcripts.fa.gz中原来的名字
>ENST00000456328.2|ENSG00000223972.5|OTTHUMG00000000961.2|OTTHUMT00000362751.1|DDX11L1-202|DDX11L1|1657|processed_transcript|

# gencode.v34.transcripts_lite.fa中简化后的名字
>ENST00000456328.2

# 获得转录本ID的注释，保存在tx2g_symbol.txt中
ENST00000456328.2       DDX11L1 ENSG00000223972.5
ENST00000450305.2       DDX11L1 ENSG00000223972.5
ENST00000488147.1       WASH7P  ENSG00000227232.5

建立kallisto使用的Index：

/sibcb1/bioinformatics/biocourse/biocourse01/bulkRNAseq/software/kallisto index \
  -i transcriptome.idx gencode.v34.transcripts_lite.fa

使用kallisto定量：

/sibcb1/bioinformatics/biocourse/biocourse01/bulkRNAseq/software/kallisto quant \
  -t 12 \
  -b 100 \
  -i idx/kallisto/transcriptome.idx \
  -o kallisto/ctrl1 \
  trimmed/ctrl1_1.fastq.gz trimmed/ctrl1_2.fastq.gz

产生的结果文件：

# head kallisto/ctrl1/abundance.tsv

target_id       length  eff_length      est_counts      tpm
ENST00000456328.2       1657    1404.57 167.725 1.75935
ENST00000450305.2       632     379.827 0       0
ENST00000488147.1       1351    1098.57 48.9905 0.657027
ENST00000619216.1       68      27.431  0       0
ENST00000473358.1       712     459.699 0       0
ENST00000469289.1       535     283.053 0       0
ENST00000607096.1       138     65.0234 0       0
ENST00000417324.1       1187    934.569 0       0
ENST00000461467.1       590     337.928 99.0074 4.31661

批量运行所有的样本：

samples=("ctrl2" "ctrl3" "stroke1" "stroke2" "stroke3")

for id in ${samples[@]}; do
        echo $id
        trim1=trimmed/${id}_1.fastq.gz
        trim2=trimmed/${id}_2.fastq.gz

        /sibcb1/bioinformatics/biocourse/biocourse01/bulkRNAseq/software/kallisto quant \
        -t 6 \
        -b 100
        -i idx/kallisto/transcriptome.idx \
        -o kallisto/$id $trim1 $trim2
done

salmon

先建立index:

/sibcb1/bioinformatics/biocourse/biocourse01/bulkRNAseq/software/salmon-latest_linux_x86_64/bin/salmon index \
  -t idx/kallisto/gencode.v34.transcripts.fa \
  -i idx/salmon/transcriptome.idx \
  -k 31 --gencode

使用salmon定量：

/sibcb1/bioinformatics/biocourse/biocourse01/bulkRNAseq/software/salmon-latest_linux_x86_64/bin/salmon quant \
  --libType IU \
  --threads 12 \
  --numBootstraps 100 \
  --index idx/salmon/transcriptome.idx \
  --gcBias \
  --output salmon/ctrl1 \
  --mates1 trimmed/ctrl1_1.fastq.gz --mates2 trimmed/ctrl1_2.fastq.gz

其中libType的设置，参考其文档：