variant的过滤 | filtering and prioritizing genetic variants

WGS和WES测序和分析会产生大量的variant数据。php

显然直接分析所有的variant是很是不靠谱的。html

作疾病的话,有一些经常使用的过滤套路。java

 

variant做用于基因表达主要分两大类:ios

1. coding,能够直接影响RNA的造成,以及后面蛋白的折叠组装;web

2. non-coding,如今最流行的就是enhancer这个媒介,已经有比较好的结果了。数据库

 

过滤的必要性bash

首先GWAS已经作了,要理解GWAS产生了哪些结果,GWAS的局限性在哪?app

Our previous meta-analysis of genome-wide association studies estimated that common variants together account for a small proportion of heritability estimated from family studies.4 Rare variants might therefore contribute significantly to the missing heritability. 框架

Most of these variants (77.5%) were novel or rare (MAF < 1%).ide

common variants是很容易经过GWAS分析找到的,由于出现的频率较高,不多的样本就有很大的power来把它们检测出来,但common variant一般都是在非编码区的,经过很是复杂的调控来影响疾病,并且common variant的解释度很低,并非疾病的主导因素。因此,目前都转向了rare variants,rare的一般都在编码区,直接改变了蛋白,影响疾病的方式比较直接,但显然咱们须要很是大的样本量才有足够的power来检测出rare variants。 

The analysis showed the strongest association of 328 variants with HSCR (P < 5  10–8), all of which mapped to the known disease susceptibility loci of RET and NRG1 (Figure 1A, upper panel).

GWAS直接找到了328个显著的variants,但显然它们的LD高度相关,最终也就是两个gene而已。并且这两个基因早就已知了,因此这个GWAS在初级层面没有任何新的有价值的发现。

Among the 936 WGS samples, a total of 4985 protein-truncating URVs were detected. 这基本就是我须要用到的数据了。

 

关于PCR扩增时候产生的错误,以及测序质量产生的错误。

用DP、GQ能够过滤一大部分,还有后面的BQSR也能够矫正。 

 

可能用到的数据库:

1. 1000 genome,测得人太少,才千把个,到某个群体就更少了

2. gnomAD,125,748 exome sequences and 15,708 whole-genome sequences,感觉一下这个霸气的测序量

3. ExAC,外显子测序,60,706 unrelated individuals

4. ensemble

 

注意的问题:

1. 疾病的人群,咱们关注的是East Asian

2. 疾病的发病率,highest among Asians (2.8/10,000 live births),通常设在千分之5比较靠谱

 

比较好用的变异注释工具(不一样工具注释出来的结果差别仍是很大的,见paper

ANNOVAR

gene-based annotation

perl annotate_variation.pl -buildver hg19 -downdb -webfrom annovar refGene humandb/
annotate_variation.pl -out ex1 -build hg19 example/ex1.avinput humandb/
convert2annovar.pl -format vcf4 HK152C.vcf > HK152C.avinput
annotate_variation.pl -out HK152C -build hg19 HK152C.avinput /home/lizhixin/softwares/annovar/humandb/

这是实际的时候,须要把vcf转成特定的格式。  

注释出来的functional consequences结果:nonsynonymous SNV, synonymous SNV, frameshift insertion, frameshift deletion, nonframeshift insertion, nonframeshift deletion, frameshift block substitution, nonframshift block substitution 

什么是nonframeshift deletion?看这里,就是以3个为一组,删除了,并无影响阅读框架。

annovar也能够用来过滤variants

annotate_variation.pl -downdb -webfrom annovar -build hg19 gnomad211_exome humandb/
annotate_variation.pl -downdb -webfrom annovar -build hg19 gnomad211_genome humandb/

 

VEP

 

KGGSeq

java -jar /home/lizhixin/softwares/kggseqhg19/kggseq.jar --buildver hg19 --vcf-file HSCR.WGS.2_5.variants.vcf.gz --db-filter 1kgeas201305,gadexome,gadgenome --rare-allele-freq 0.005 --o-vcf

'--rare-allele-freq c' will excluded variants with alternative allele frequency EQUAL to or over c in the reference datasets

 

 

过滤的标准

  • allele frequency,如:把高于千分之5的过滤掉
  • 已知基因集
  • 杂合纯合
  • protein-truncating (stopgain, splicing, or frameshift) 

 

example: rs2435357

gnomAD,这还能用allele frequency来过滤吗?这个是common variants,在非编码区,effect size是很是小的。

 

Variant Annotation 参见paper

Annotation was done using KGGseq for protein function against the RefGene, pathogenicity, and population frequencies.

We defined protein-truncating variants as those that lead to (1) gain of the stop codon, (2) frameshift and (3) alteration of the essential splice sites.

Damaging variants include all proteintruncating variants and missense or in-frame variants predicted to be deleterious by KGGseq. Benign variants are missense variants or in-frame variants predicted benign by KGGseq.

Finally, protein-altering variants comprise both damaging and benign variants. Rare variants are those whose minor allele frequency (MAF) is <0.01 in public databases. Ultra-rare variants (URVs) are defined as a singleton variant, that is, one that appeared only once in our whole data set, not present in dbSNP138 or public databases

参见KGGseq的这个命令:Gene feature filtering

 

variant的类型:

  • Putative LoF variants
  • Nonsynonymous and missense variants
  • Synonymous variants
  • Exonic variants

A frameshift mutation is a genetic mutation caused by a deletion or insertion in a DNA sequence that shifts the way the sequence is read.

a transcript is defined by its exons, introns and UTRs and their locations

 

牢记经典的基因结构模型很是重要:

梳理一下:

在基因组上,有promoter和enhancer,他们在转录因子的做用下启动转录过程,而后就进入基因的结构,基因的先后都有UTR,就是不转录的区域,而后就是由Exon和Intron交替排列的核心区域。intron里面每每有不少调控元件,如enhancer。

 

 

参考:

KGGSeq: A biological Knowledge-based mining platform for Genomic and Genetic studies using Sequence data

A practical guide to filtering and prioritizing genetic variants

Choice of transcripts and software has a large effect on variant annotation

Gene Structure - mRNA和蛋白是如何转化而来的

Regulation of Gene Expression: Operons, Epigenetics, and Transcription Factors - 调控是如何进行的

Eukaryotic Gene Regulation part 1

 

细节操做:

vcftools的下载和安装

Extract subset of samples from multigenome vcf file

拆分样本,独立注释:

for i in HK152C  HK154C  HK162C  HK175C  HK180C; do
echo $i
vcf-subset -e -c $i  hscr2zxl.sel.vcf.gz > ${i}.vcf # | bgzip  -c
done

  

无义介导的mRNA降解(nonsense-mediated mRNA decay,NMD)

Nonsense-mediated RNA decay in the brain: emerging modulator of neural development and disease