Bioinformatics training course: scRNA-seq: Alignment of scRNA-seq data with STAR and CellRanger

全局设定

我们将使用系统内安装的R：

# bash命令
export PATH="/sibcb1/bioinformatics/biocourse/biocourse01/miniforge3/envs/scRNA/bin:$PATH"
# source /sibcb/program/install/r-4.0/profile
R

建立数据软链接：

ln -s /sibcb1/bioinformatics/biocourse/biocourse01/STARsolo/Fastq .
ln -s /sibcb1/bioinformatics/biocourse/biocourse01/STARsolo/refdata-gex-mm10-2020-A .
ln -s /sibcb1/bioinformatics/training_shared/database/scRNA/Barcodes .
ln -s /sibcb1/bioinformatics/biocourse/biocourse01/STARsolo/GTF .

加载R包：

# .libPaths('/sibcb1/bioinformatics/training_shared/software/rlib/r-4.0_lib')
# library("reticulate")
library("dplyr")
library(Seurat)

测试数据

/sibcb1/bioinformatics/biocourse/biocourse01/miniforge3/bin/tree Fastq/
  Fastq/
  ├── 1k_mouse_kidney_CNIK_3pv3_S1_L004_I1_001.fastq.gz
  ├── 1k_mouse_kidney_CNIK_3pv3_S1_L004_I2_001.fastq.gz
  ├── 1k_mouse_kidney_CNIK_3pv3_S1_L004_R1_001.fastq.gz
  └── 1k_mouse_kidney_CNIK_3pv3_S1_L004_R2_001.fastq.gz

上图描述的是10x Reagent Kits v3.1的文库结构(Source:10X Genomics)：其中I1、I2分别是样本Index；R1包含了细胞的Barcode（16bp）和RNA分子的UMI（12bp）；R2是RNA分子序列。

CellRanger

构建参考基因组

使用CellRanger的mkref来构建参考基因组：

/sibcb1/bioinformatics/training_shared/software/cellranger-7.0.0/cellranger mkref \
  --genome=refdata-gex-mm10-2020-A \
  --fasta=/sibcb1/bioinformatics/training_shared/database/scRNA/refdata-gex-mm10-2020-A/fasta/genome.fa \
  --genes=/sibcb1/bioinformatics/training_shared/database/scRNA/refdata-gex-mm10-2020-A/genes/genes.gtf \
  --nthreads=24

产生的结果文件如下：

/sibcb1/bioinformatics/biocourse/biocourse01/miniforge3/bin/tree refdata-gex-mm10-2020-A

refdata-gex-mm10-2020-A
├── fasta
│   ├── genome.fa
│   └── genome.fa.fai
├── genes
│   └── genes.gtf.gz
├── reference.json
└── star
    ├── chrLength.txt
    ├── chrNameLength.txt
    ├── chrName.txt
    ├── chrStart.txt
    ├── exonGeTrInfo.tab
    ├── exonInfo.tab
    ├── geneInfo.tab
    ├── Genome
    ├── genomeParameters.txt
    ├── SA
    ├── SAindex
    ├── sjdbInfo.txt
    ├── sjdbList.fromGTF.out.tab
    ├── sjdbList.out.tab
    └── transcriptInfo.tab

4 directories, 19 files

比对

CellRanger的基本方法如下(Source: 10x Genomics)：

使用如下类似的命令进行比对：

/sibcb1/bioinformatics/training_shared/software/cellranger-7.0.0/cellranger count \
    --id=Mouse1k \
    --transcriptome=refdata-gex-mm10-2020-A/ \
    --fastqs=Fastq \
    --sample=1k_mouse_kidney_CNIK_3pv3 \
    --localcores=12 \
    --localmem=64 \
    --include-introns true # 10x公司的推荐设置

注意：具有相同BC和UMI的reads只计算一次 (Macosko et al., 2015 Cell)：

注意：从Cell Ranger v7.0开始，Intronicreads are counted by default，--include-intron设置为true。产生的结果文件如下：


参与后续分析的文件保存在`filtered_feature_bc_matrix`:

Mouse1k/outs/filtered_feature_bc_matrix
├── barcodes.tsv.gz
├── features.tsv.gz
└── matrix.mtx.gz

结果报告

除了上述的结果文件，CellRanger还会产生html形式的报告，比如结果报告。

STARsolo

构建索引

/sibcb/program/install/star-2.7.11b/STAR \
    --runThreadN 24 \
    --runMode genomeGenerate \
    --genomeFastaFiles refdata-gex-mm10-2020-A/fasta/genome.fa \
    --sjdbGTFfile refdata-gex-mm10-2020-A/genes/genes.gtf \
    --genomeDir star

比对

/sibcb/program/install/star-2.7.11b/STAR \
  --runThreadN 4 \
  --genomeDir star \
  --readFilesIn Fastq/1k_mouse_kidney_CNIK_3pv3_S1_L004_R2_001.fastq.gz \
    Fastq/1k_mouse_kidney_CNIK_3pv3_S1_L004_R1_001.fastq.gz \
  --readFilesCommand zcat \
  --soloFeatures GeneFull \
  --outSAMattributes CB UB \
  --outSAMtype BAM SortedByCoordinate \
  --outFileNamePrefix M1K_ \
  --outFilterScoreMin 30 \
  --soloType CB_UMI_Simple \
  --soloCBstart 1 --soloCBlen 16 \
  --soloUMIstart 17 --soloUMIlen 12 \
  --clipAdapterType CellRanger4 \
  --soloCBmatchWLtype 1MM_multi_Nbase_pseudocounts \
  --soloUMIdedup 1MM_CR \
  --soloUMIfiltering MultiGeneUMI_CR \
  --soloCBwhitelist Barcodes/10X_V3_3M-february-2018.txt \
  --soloCellFilter EmptyDrops_CR

经过测试，在sibcb1上运行的时候，不能设置太高的线程（比如24），否则可能返回无法创建文件的错误。其中的一些参数解释如下：

outSAMattributes定义保留到BAM文件中的属性，主要包括:

 NH (Number of Hits):表示一条读段在基因组上的比对位置数量,用于衡量reads的多重比对情况
 HI (Hit Index):当前比对结果是该reads的第几个比对位置,与NH 配合使用
 AS (Alignment Score):比对得分,反映比对质量的数值评分
 nM (Number of Mismatches):错配数量,用于评估比对准确性
 rG (Read Group):reads所属的分组信息,用于区分不同的样本或测序批次
 UQ (Unique Mapping Quality):unique mapping的质量分数,反映比对的唯一性程度
 GX (Gene ID):基因ID标识符,用于RNA-seq分析中的基因定量
 GN (Gene Name):基因名称,方便可读性的基因注释信息
 CB (Cell Barcode):单细胞测序中的细胞条形码序列,用于区分不同细胞来源的reads
 UB (UMI Barcode):UMI(Unique Molecular Identifier)序列,用于去除PCR重复

clipAdapterType表示依照什么规则去除adapters，这里选择CellRanger4规则，模拟CellRanger 4.x的接头剪切算法，专门针对10x Genomics数据优化。

soloUMIdedup用于设置UMI（Unique Molecular Identifier）的去重复策略，设定成1M_CR，模拟CellRanger的UMI去重复算法，允许1个碱基错配。

soloUMIfiltering设定UMI的过滤策略，这里设定为MultiGeneUMI_CR,带有N和homopolymers的UMI会被过滤掉(CellRanger 2.2.0)；另外，比对到多个基因的UMI也会被过滤掉。

/sibcb1/bioinformatics/biocourse/biocourse01/miniforge3/bin/tree runOut/M1K_Solo.out/

runOut/M1K_Solo.out/
├── Barcodes.stats
└── GeneFull
    ├── Features.stats
    ├── filtered
    │   ├── barcodes.tsv.gz
    │   ├── features.tsv.gz
    │   └── matrix.mtx.gz
    ├── raw
    │   ├── barcodes.tsv
    │   ├── features.tsv
    │   └── matrix.mtx
    ├── Summary.csv
    └── UMIperCellSorted.txt

4 directories, 10 files

比较两种方法的结果

d1 = Read10X(data.dir = "Mouse1k/outs/filtered_feature_bc_matrix")
d2 = Read10X(data.dir = "runOut/M1K_Solo.out/GeneFull/filtered")

# 检测到基因数和细胞数
dim(d1)

[1] 32285  1385

dim(d2)

[1] 32285  1385

可以看出，两种方法检测出了数目完全相同的基因数和细胞数。接下来看看检查出的Reads综述。

s1 = colSums(d1)
names(s1) = gsub('-1', '', names(s1))
s2 = colSums(d2)
common = intersect(names(s1), names(s2))
s1 = s1[common]
s2 = s2[common]
cor(s1, s2)

[1] 0.9999967

# 虽然高度相关，CellRanger得到的Reads略微少一点点。
summary(lm(s1~s2))


Call:
lm(formula = s1 ~ s2)

Residuals:
    Min      1Q  Median      3Q     Max 
-91.897  -9.610   0.202   9.934 249.006 

Coefficients:
             Estimate Std. Error   t value Pr(>|t|)    
(Intercept) 4.968e+00  7.377e-01     6.735  2.4e-11 ***
s2          9.920e-01  6.822e-05 14541.122  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 20.74 on 1382 degrees of freedom
Multiple R-squared:      1, Adjusted R-squared:      1 
F-statistic: 2.114e+08 on 1 and 1382 DF,  p-value: < 2.2e-16

使用velocyto

使用STARsolo自主分析的好处是能够产生方便的BAM文件，即使数据来源于10x以外的平台，方便其他分析工具的使用，比如velocyto。

安装软件


# 安装velocyto
# 寻求帮助请联系：support@sibcb.ac.cn
mamba create -n velocyto python=3.10 numpy=1.26 scipy cython numba matplotlib scikit-learn h5py click
mamba activate velocyto
pip install velocyto

# export PATH="/sibcb1/bioinformatics/biocourse/biocourse01/miniforge3/envs/velocyto/bin:$PATH"
velocyto --help
  Commands:
    run            Runs the velocity analysis outputting a loom file
    run10x         Runs the velocity analysis for a Chromium Sample
    run-dropest    Runs the velocity analysis on DropEst preprocessed data
    run-smartseq2  Runs the velocity analysis on SmartSeq2 data (independent bam file per cell)
    tools          helper tools for velocyto

下载Repeat的注释

可以从UCSC genome browser下载Repeat的注释，因为表达的重复序列是下游分析的混杂因素。

运行

velocyto run10x -m GTF/mm10_rmsk.gtf Mouse1k GTF/genes.gtf -@ 12 --dtype uint32

如果正确运行，会在10x的结果文件中产生一个loom文件Mouse1k/velocyto/Mouse1k.loom。