GATK之HaplotypeCaller

时间 2019-12-05

标签 gatk haplotypecaller 繁體版

原文原文链接

GATK的主要功能其实就是识别变异位点，其余功能都是锦上添花。因此这一次学习GATK寻找变异位点的工具。
在GATK的文档中，与变异位点识别相关的有9个工具，分别是：java

Name	Summary
ApplyRecalibration	Apply a score cutoff to filter variants based on a recalibration table
CalculateGenotypePosteriors	Calculate genotype posterior likelihoods given panel data
GATKPaperGenotyper	Simple Bayesian genotyper used in the original GATK paper
GenotypeGVCFs	Perform joint genotyping on gVCF files produced by HaplotypeCaller
HaplotypeCaller	Call germline SNPs and indels via local re-assembly of haplotypes
MuTect2	Call somatic SNPs and indels via local re-assembly of haplotypes
RegenotypeVariants	Regenotypes the variants from a VCF containing PLs or GLs.
UnifiedGenotyper	Call SNPs and indels on a per-locus basis
VariantRecalibrator	Build a recalibration model to score variant quality for filtering purposes

稍微看了一下GATK最佳实践文档，发现目前用的最多的是HaplotypeCaller，准确率高，可是比较吃配置，因此运行时间会比较久。不过因为HaplotypeCaller的工做原理，直接省去了BQSR和indel realignment步骤，因此对于一个variant calling流程而言，能够直接比对，去重复后运行HaplotypeCaller。app

简介

功能：经过局部单倍型重组装寻找体细胞的SNP和indel
分类：变异位点探索功能ide

HaplotypeCaller，简称HC，能过经过对活跃区域（也就是与参考基因组不一样处较多的区域）局部重组装，同时寻找SNP和INDEL。换句话说，当HC看到一个地方好活跃呀，他就无论以前的比对结果了，直接对这个地方进行从新组装，因此就比传统的基于位置（position-based)的工具，如前任UnifiedGenotyper好用多了。工具

HC能够处理非二倍体物种和混池实验数据，可是不推荐用于癌症或者体细胞。
HC还能够正确处理可变剪切，因此也能用于RNA-Seq数据。post

HC有一个GVCF模式，产生各个样本的中间基因组文件(gVCF),而后用于多样本的联合基因定型(joint genotyping)，效率很高呢。学习

工做原理

HaplotypeCaller的核心操做就是四步：ui

寻找活跃区域，就是和参考基因组不一样部分较多的区域spa
经过对该区域进行局部重组装，肯定单倍型（haplotypes）。就是这一步能够省去indel realignmentcode
在给定的read数据下，计算单倍型的可能性。orm
分配样本的基因型

使用说明

输入：BAM文件
输出：原始的VCF文件或者gVCF文件，未过滤SNP和INDEL。通常而言，在进行下游分析，须要对这些数据要么手动要么VQSR过滤。

案例:
DNA-seq variant-calling

  java -jar GenomeAnalysisTK.jar \
     -R reference.fasta \
     -T HaplotypeCaller \
     -I sample1.bam [-I sample2.bam ...] \
     [--dbsnp dbSNP.vcf] \
     [-stand_call_conf 30] \
     [-L targets.interval_list] \
     -o output.raw.snps.indels.vcf

RNA-seq variant-calling

  java -jar GenomeAnalysisTK.jar \
     -R reference.fasta \
     -T HaplotypeCaller \
     -I sample1.bam \
     [--dbsnp dbSNP.vcf] \
     -stand_call_conf 20 \
     -o output.raw.snps.indels.vcf

其余感受比较使用的参数：

参数名	默认值	概要
—genotyping_mode/-gt_mode	DISCOVERY	如何处理识别的等位基因
—min_base_quality_score/ -mbq	10	最低碱基质量
—sample_ploidy/-ploidy	2	样本是几倍体？
—standard_min_confidence_threshold_for_calling/-stand_call_conf	10.0	肯定variant的最低phred-scaled 置信阈值
—annotation/-A	-	variant注释内容
—pcr_indel_model/-pcrModel	CONSERVATIVE	PCR indel模式

注：对于我作mapping-by-sequencing而言，须要结果有ref和alt碱基的支持数，因此选项-A必定要跟上StrandAlleleCountsBySample。

这是我用于MBS的流程的GATK部分

##############################
# Vaiant Discovery with GATK #
##############################mkdir -p ${wd_dir}/analysis/variant_gatk
recal_files=$(ls *recal_reads.bam)
for file in $recal_files
do
    java -Xmx16g -jar $gatk -T HaplotypeCaller \
    -R $reference  -I $file --genotyping_mode DISCOVERY \
    --standard_min_confidence_threshold_for_calling 10 \
    -A StrandAlleleCountsBySample \
    -o ../variant_gatk/${file%%.*}_raw_variants.vcf
done