获取指定列中的连续数字

生物信息学中一般用c.110A->G表示突变位点,要转回绝对坐标时,一般用c.110匹配到refgene。若是是下面的数据:spa

OTC     NM_000531       8.7Mb
OTC     NM_000531       9095
ASS1    NM_000050       c.1127-9_1185dup67(described
CPS1    NM_001122633    35
RYR1    NM_000540       27
NAT1    NM_001160175    6
G6PD    NM_000402       c.1084_1101delCTGAACGAGCGCAAGGCC
NAT2    NM_000015       c.857G>Acode

你必须转换成:blog

OTC     NM_000531       8.7
OTC     NM_000531       9095
ASS1    NM_000050       c.1127-9_1185
CPS1    NM_001122633    35
RYR1    NM_000540       27
NAT1    NM_001160175    6
G6PD    NM_000402       c.1084_1101
NAT2    NM_000015       c.857class

-----------------------------------------------------------------------awk

第三列我只想要连续出现的数字片断(容许“-”和"_"),应该怎么取?
--------------------------------------------------------------------sed

cat i
OTC     NM_000531       8.7Mb
OTC     NM_000531       9095
ASS1    NM_000050       c.1127-9_1185dup67(described
CPS1    NM_001122633    35
RYR1    NM_000540       27
NAT1    NM_001160175    6
G6PD    NM_000402       c.1084_1101delCTGAACGAGCGCAAGGCC
NAT2    NM_000015       c.857G>A
sed -r 's/(.*\s)(c?[0-9._-]*).*/\1\2/' i
OTC     NM_000531       8.7
OTC     NM_000531       9095
ASS1    NM_000050       c.1127-9_1185
CPS1    NM_001122633    35
RYR1    NM_000540       27
NAT1    NM_001160175    6
G6PD    NM_000402       c.1084_1101
NAT2    NM_000015       c.857
---------------------------------------------------------------------------------数据

 

awk '{i=match($3, "^(c.)?[-_0-9]+", a); print $1"\t"$2"\t"a[0]}' i
----------------------------------------------------------------------------------
awk -F"\t" '{print $(NF-1)"\t"$NF"\tHet\t"$4}' $i".for_fr"|awk '{i=match($4, "(ins[a-z]*)|(del[a-z]*)|([A-Z]>)?[A-Z]*$", a); print $1"\t"$2"\t"$3"\t"a[0]}'|awk '{if(NF>3)print}' >$i".use.for_py"
相关文章
相关标签/搜索