对测序返还结果进行处理，并进行批量blast

时间 2019-11-25

标签返还结果进行处理并进批量 blast 繁體版

原文原文链接

引言

背景介绍

由于实验室用的seqHunter年代太老了，并且最近须要进行不少扩增，测序，因此考虑搭建一个简单的批量化处理流程。python

软件

python
Windows下的cmd命令
ncbi的blast+

步骤

blast+的本地化搭建

在ncbi网站上下载blast+（好像blast+的运行须要JAVA环境）
安装
- 新建db文件夹做为数据库文件夹
- input和output做为输入输出文件夹
添加环境变量
- 变量名：BLASTDB ，变量值：\path\ncbi\blast-2.7.1\db
- 变量名：Path，变量值：\path\ncbi\blast-2.7.1\bin

构建数据库

数据准备
- ncbi上能够下载别人上传的
- 也能够选择把本身的几条须要变成一个fasta文件，从而做为建库文件
格式化数据库(这里我伪装有把我全部的481同源序列变成了一个fasta文件)

cd path\blast-2.7.1\db # 把当前目录切到db下，这些命令都是在cmd下敲的。    
makeblastdb -in 481.fasta -dbtype nucl -parse_seqids -hash_index -out 481_nucl

参数说明数据库

-in 须要建库的fasta序列  
-dbtype 若是是蛋白库则用prot，核酸库用nucl  
-out 所建数据库的名称  
-parse_seqids 和 -hash_index 通常都加上，解释以下:  
-parse_seqids => Parse Seq-ids in FASTA input         # 我也不懂这是干啥  
-hash_index => Create index of sequence hash values

对公司返还的测序数据进行整理

去除ab1结尾的文件

from path import Path
import os
a = input('input your path')
d = Path(a)
for f in d.walk():   # 遍历你选择的目录下全部的文件名
    if f.endswith('ab1'): # 若是你的文件名后缀是ab1
        os.remove(f) # 移除文件

将后缀名为seq的改成fasta（由于好像要比对的文件必须是fasta格式）

for f in d.walk():
    if f.endswith('seq'):
        portion = os.path.splitext(f) # 这一步会把文件名拆成后缀加文件名的列表
        newname = portion[0] + '.fasta' # 选择其中的文件名，而后加上fasta
        os.rename(f, newname)

利用cmd的for in循环批量进行blast

for %i in (*fasta) do blastn -query %i -db ..\\db\481_nucl -evalue 1e-3 -outfmt 7 -out ..\output\%i.txt  
# 通过实践，最好是cd 到input文件夹下，即%i产生的文件名不要带有路径，否则最后的output没法输出

参数说明ide

-query 输入文件名，也就是须要比对的序列文件  
-db 格式化后的数据库名称  
-evalue 设定输出结果中的e-value阈值  
-out 输出文件名  
-num_alignments 输出比对上的序列的最大值条目数  
-num_threads 线程数  
此外还有：  
-num_descriptions 对比对上序列的描述信息，通常跟tabular格式连用  
-outfmt      
 0 = pairwise,  
 1 = query-anchored showing identities,  
 2 = query-anchored no identities,  
 3 = flat query-anchored, show identities,  
 4 = flat query-anchored, no identities,  
 5 = XML Blast output,  
 6 = tabular,  
 7 = tabular with comment lines,  
 8 = Text ASN.1,  
 9 = Binary ASN.1  
 10 = Comma-separated values

利用python自动对没有比对上的输出文件进行剔除

利用文件中包含有'0 hits found'网站

c = Path('F:/科研经常使用软件/ncbi/blast-2.7.1+/output')
for f in c.walk():
    with open(f) as fh:
        a = fh.read()
        if '0 hits found' in a:
            fh.close()
            # with open 应该是等整个执行完后自动close
            # 但咱们这里因为程序在未结束的状况下，就要remove文件
            # 因此会出现程序被占用，没法结束的状况
            # f.close强制关闭文件，而后执行remove
            os.remove(f)

小结

听说有个叫ncbiwww的包能够进行批量化，下次能够试试
blast产生的结果仍是要好好研究下