以12SNP为阈值判断结核菌的传播链

在结核菌中,以12SNP为阈值判断菌株间是否为近期传播。利用生物信息学方法从原始测序结果获得各菌株的SNP文件,经过互相比较获得两个文件:(1)各菌株间相互比较的状况,格式为:菌株AVS菌株B  SNP距离,总行数为n*(n-1)/2 (2)利用文件1获得近期传播的菌株列表接下来要把这些菌株串起来,造成传播链,核心思想是:链内的任一菌株与至少1个链内其它菌株的SNP距离<=12;链内的任一菌株与链外全部菌株的距离都>12所以,核心思想也就是“迭代”,并不断删除数据,减小计算量import sysimport ref1=open(sys.argv[1]).readlines()   #strain distancef2=open(sys.argv[2]).readlines()   #distance <= 12 strain listf3=open("rough_transmission_chain","a")sum_chain_list=[i.rstrip().split()[0] for i in f1 if int(i.rstrip().split()[1])<=12]def find_transmission_strain(strain):        cycle_strain = []        match_strain = []        True_strain = [i for i in sum_chain_list if i.startswith(strain + "VS") or i.endswith("VS" + strain)]           for item in True_strain:                temp = re.sub("VS", "", item)                sum_chain_list.remove(item)                                     if temp[:len(strain)] == strain:                        match_strain.append(temp[len(strain):])                        cycle_strain.append(temp[len(strain):])                else:                        match_strain.append(temp[0:len(temp) - len(strain)])                        cycle_strain.append(temp[0:len(temp) - len(strain)])        if len(cycle_strain) > 0:                for a in range(len(cycle_strain)):                        for b in range(len(cycle_strain)):                                try:                                        sum_chain_list.remove(cycle_strain[a] + "VS" + cycle_strain[b])                                except ValueError as e:                                        pass                for i in cycle_strain:                        match_strain.extend(find_transmission_strain(i))                          return set(match_strain)temp = []for i in f2:        i = i.rstrip()if i not in temp:        a = list(find_transmission_strain(i))        a.append(i)        temp.extend(a)        for strain in a:                f3.write(strain + "\t")        f3.write("\n")else:        pass
相关文章
相关标签/搜索