GATK

fastp

bwa---产生sam文件

bwa有三种算法,其中mem比较全面
bwa index ref.fa
先建立index,下载参考基因组的fasta文件。
？不知道可不可以用压缩文件，教程是解压的。在操作的时候我也解压了
bwa mem ref.fa read1.fq read2.fq > aln-pe.sam #无参数R
这里输入的.fq可以是压缩文件
bwa mem -R '@RG\tID:group_n\tLB:library_n\tPL:illumina\tPU:unit1\tSM:sample_n' ref.fa read1.fq read2.fq > aln-pe.R.sam #加上参数R
GATk 要求read group的格式
Read group是@RG开始。

ID = Read group identifier
每一个Read group独有的ID；
Illumina 测序数据中，read group IDs由flowcell ，lane name 和number组成。
在矫正碱基质量时，read group IDs对区分技术批次效应是必须的；在这过程中，同一read group的reads假定为有一样的技术误差。

PU = Platform Unit
Platform Unit由三部分组成： {FLOWCELL_BARCODE}.{LANE}.{SAMPLE_BARCODE}
{FLOWCELL_BARCODE} refers to the unique identifier for a particular flow cell;
The {LANE} indicates the lane of the flow cell ;
The {SAMPLE_BARCODE} is a sample/library-specific identifier;
GATK 使用时，PU不是必须要求的；但是PU与ID同时存在时，PU优先级高于ID。

SM = Sample
reads属于的样品名；SM要设定正确，因为GATK产生的VCF文件也使用这个名字。

PL= Platform/technology used to produce the read
测序使用的平台： ILLUMINA, SOLID, LS454, HELICOS and PACBIO。

LB = DNA preparation library identifier
对一个read group的reads进行重复序列标记时，需要使用LB来区分reads来自那条lane;有时候，同一个库可能在不同的lane上完成测序;为了加以区分，同一个或不同库只要是在不同的lane产生的reads都要单独给一个ID。

作者：JeremyL
链接：https://www.jianshu.com/p/9a29bfc87a50
来源：简书

JZTXT

bwa比对

GATK

fastp

bwa---产生sam文件