Min Edit Distance

Min Edit Distance

————两字符串之间的最小距离

PPT原稿参见Stanford;http://www.stanford.edu/class/cs124/lec/med.pdfpython

Tips:因为本人水平有限,对MED的我的理解可能有纰漏之处,请勿尽信。算法

Edit:我的理解指编辑之意,也即对于两个字符串,对其中的一个进行各类编辑操做(插入、删除、替换)使其变为另外一个字符串。要解决的问题是求出最小的编辑操做次数是多少。编程

clip_image002[4]

基因系列比对数组

定义距离:

X,Y是大小分别为n,m的字符串。app

定义D(i,j)表示X[1..i],Y[1..j]两子字符串的距离。则:框架

X与Y的距离为D(n,m)函数

应用动态规划的方法:

clip_image012[4]

对于D(i,j)的计算结果作成表格(矩阵),D(i,j)的运算结果能够创建在以前的结果之上。post

对本算法的一点我的理解:测试

设 i:输出子字符串长度this

    j:输入子字符串长度

D(0,0)=0

D(i,0):输入0个字符转换到i个字符的输出,也即要插入i个字符,代价为insertCost*i

D(0,j):目标长度为0,输入长度为j,因此代价为:deletionCost*j

D(i,j)的上一步能够分为三种状况:

一、上一步输入长度为j,输出长度为i-1,那么如今这一步确定至少还要插入一个字符才能达到i长度输出:

D(i-1,j)+inseartCost*1

二、上一步输入长度为j-1,输出长度为i-1,那么如今这一步第j个输入只须要作替换处理(若是第j个输入与第i个输出不相等)或者保持不变(若是第j个输入与第i个输出相等):

D(i-1,j-1)+substituteCost*(source[j]==target[i] ? 0 : 1)

三、上一步输入长度为j-1,输出长度为i,那么如今这一步因为又多了一个字符,因此要把多的这个字符删除:

D(i,j-1)+deletionCost*1

因为咱们要求的是最小Edit Distance,天然,就是上述三种状况中最小值作为D(i,j)的值。具体算法以下:

最小编辑距离动态算法(Levenshtein):

clip_image014[4]

D(n,m)即为最小距离。

 

字符串对齐:

    每一次计算D(i,j) 时记录当前数据是由哪一个数据得的。当D(n,m) 的计算完成后,便可从D(n,m) 出发进行回溯(backtrace).获得和条从D(n,m) 到D(0,0) 的路径:

clip_image019[4]

clip_image021[4]

上图中从(0,0)到(n,m)的每个非递减路径即为两字符串的对齐关系。

附加回溯过程的MED:

clip_image023[4]

Weighted Edit Distance:(加权编辑距离)

为何要加权?

由于有些字符被写错的几率要大些(如搜索引擎中常常能自动搜索到相近的词句)

clip_image025[4]

算法:

clip_image027[4]

 


Levenshtein算法python实现:

#===============================================================================
# Using dynamic programming to realize the MED algorithm(Levenshtein)
# MED: short for Minimum Edit Distance
#===============================================================================

import types
import numpy as np

class MED:
    def __init__(self):
        self.insCost = 1    #insertion cost
        self.delCost = 1    #deletion cost
        self.subsCost = 1   #substitution cost
        self.D = 0

    def mDistance(self, word1, word2):
        assert type(word1) is types.StringType
        assert type(word2) is types.StringType
        size1 = len(word1)
        size2 = len(word2)
        if size1==0 or size2 ==0:
            return size1*self.delCost+size2*self.insCost
        D_mat = np.zeros((size1+1,size2+1))
        D_mat[:,0] = range(size1+1)
        D_mat[0,:] = range(size2+1)
        for i in range(1,size1+1):
            for j in range(1,size2+1):
                insert_cost = D_mat[i-1, j] + self.insCost*1
                delete_cost = D_mat[i, j-1] + self.delCost*1
                if word1[i-1]==word2[j-1]:
                    substitue_cost = D_mat[i-1, j-1]
                else:
                    substitue_cost = D_mat[i-1, j-1] + self.subsCost*1
                D_mat[i,j] = min(insert_cost, delete_cost, substitue_cost)
        self.D = D_mat
        return D_mat[size1, size2]

if __name__ == "__main__":
    word1 = "Function"
    word2 = "fanctional"
    med = MED()
    print "MED distance is :" ,med.mDistance(word1, word2)

输出结果:

MED distance is : 4.0

 

扩展应用1:利用回溯作字符对齐

对原ED计算函数作点更改(每次获得MED时记录该值的来源,而后从D(n,m)开始利用记录的来源回溯至起始位置,获得该MED的完整路径):

def computingAlignment(self, word1,  word2):
        assert type(word1) is types.StringType
        assert type(word2) is types.StringType
        size1 = len(word1)
        size2 = len(word2)
        if size1==0 or size2 ==0:
            return size1*self.delCost+size2*self.insCost
        D_mat = np.zeros((size1+1,size2+1))
        D_rec = np.zeros((size1+1, size2+1))
        D_mat[:,0] = range(size1+1)
        D_mat[0,:] = range(size2+1)
        for i in range(1,size1+1):
            for j in range(1,size2+1):
                insert_cost = D_mat[i-1, j] + self.insCost*1
                delete_cost = D_mat[i, j-1] + self.delCost*1
                if word1[i-1]==word2[j-1]:
                    substitue_cost = D_mat[i-1, j-1]
                else:
                    substitue_cost = D_mat[i-1, j-1] + self.subsCost*1
                D_mat[i,j] = min(insert_cost, delete_cost, substitue_cost)
                if D_mat[i,j] == insert_cost:#Record Where the min val comes from
                    D_rec[i,j] = 1
                elif D_mat[i,j]== delete_cost:
                    D_rec[i,j] = 2
                else:
                    D_rec[i,j] = 3
        self.D = D_mat
        self.D_rec = D_rec
        # BackTrace
        alignRevPath=[]#Get the reverse path
        j = size2
        i = size1
        while i!=0 or j!=0:#Be carefull of this row
            alignRevPath.append([i,j,D_rec[i,j]])
            if D_rec[i,j]==1:
                i -=1
            elif D_rec[i,j]==2:
                j -=1
            elif D_rec[i,j]==3:
                i -=1
                j-=1
            elif D_rec[i,j]==0:
                if i>0:
                    i -= 1
                if j>0:
                    j -= 1
        alignStr1 =[]
        alignStr2 =[]
        if alignRevPath[-1][0]!=0:#process the first postion of the path
            alignStr1.append(word1[alignRevPath[-1][0]-1])
        else:
            alignStr1.append('*')
        if alignRevPath[-1][1]!=0:
            alignStr2.append(word2[alignRevPath[-1][1]-1])
        else:
            alignStr2.append('*')
        
        for i in range(len(alignRevPath)-1, 0, -1): #process the rest of the path
            k = np.subtract(alignRevPath[i-1], alignRevPath[i])
            bK = k>0
            if bK[0]!=0:
                alignStr1.append(word1[alignRevPath[i-1][0]-1])
            else:
                alignStr1.append('*')
                
            if bK[1]!=0:
                alignStr2.append(word2[alignRevPath[i-1][1]-1])
            else:
                alignStr2.append('*')
        return (alignStr1, alignStr2)

上面的代码中alignRevPath用来保存路径上的每个位置,其每一个元素都为3元列表,前两维为路上的坐标,第3维取0、一、二、3四种值,0表示路径到达边界了,1表示当前的ED结果由前一个insert操做获得,2表示当前ED结果由前一个Delete获得,3表示当前ED结果由前一个substitue获得。

在主程序中加入以下测试代码:

word1 = "efnction"
word2 = "faunctional"
med = MED()
w1, w2 = med.computingAlignment(word1, word2)
print ' '.join(w1)
print '| '*len(w1)
print ' '.join(w2)

获得的输出:

clip_image002[6]

 


扩展应用2:匹配最长子字符串(不必定连续)

原理:

clip_image002[8]

clip_image004

图示:

clip_image006

从上图能够看出,匹配的并不必定是连续的子串.这是由于咱们的惩罚项设置为:

clip_image008

也即迭代过程clip_image010中的s(xi,yj)取-1(不匹配)或1(匹配)。当子串中有不匹配的字符出现时,将会对以前的匹配计数减1。

若是想匹配最长连续子串,能够令惩罚项F(i,j)为0(不匹配)或F(i-1,j-1)+1(匹配)

Python 实现(对computingAlignment作少量更改):

def longestSubstr(self, word1,  word2):
        assert type(word1) is types.StringType
        assert type(word2) is types.StringType
        size1 = len(word1)
        size2 = len(word2)
        if size1==0 or size2 ==0:
            return size1*self.delCost+size2*self.insCost
        D_mat = np.zeros((size1+1,size2+1))
        D_rec = np.zeros((size1+1, size2+1))
        D_mat[:,0] = 0
        D_mat[0,:] = 0
        for i in range(1,size1+1):
            for j in range(1,size2+1):
                insert_cost = D_mat[i-1, j] - self.insCost*1
                delete_cost = D_mat[i, j-1] - self.delCost*1
                if word1[i-1]==word2[j-1]:
                    substitue_cost = D_mat[i-1, j-1] +1
                else:
                    substitue_cost = D_mat[i-1, j-1] - self.subsCost*1
#                     substitue_cost = 0
                D_mat[i,j] = max(0,insert_cost, delete_cost, substitue_cost)
                if D_mat[i,j] == insert_cost:#Record Where the min val comes from
                    D_rec[i,j] = 1
                elif D_mat[i,j]== delete_cost:
                    D_rec[i,j] = 2
                elif D_mat[i,j]==substitue_cost:
                    D_rec[i,j] = 3
                    
        self.D = D_mat
        self.D_rec = D_rec
        maxVal = np.max(D_mat)
        maxBool = D_mat == maxVal
        numMax = np.sum(maxBool)
        alignRevPaths=[]
        for i in range(numMax):
            alignRevPaths.append([])
        pathStarts=[]
        for i in range(size1+1):
            for j in range(size2+1):
                if maxBool[i,j]:
                    pathStarts.append([i,j])
        
        for m in range(numMax):
            i = pathStarts[m][0]
            j = pathStarts[m][1]
            while i!=0 and j!=0:
                alignRevPaths[m].append([i,j,D_rec[i,j]])
                if D_rec[i,j]==1:
                    i -=1
                elif D_rec[i,j]==2:
                    j -=1
                elif D_rec[i,j]==3:
                    i -=1
                    j-=1
                elif D_rec[i,j]==0:
                    break
                if D_mat[i,j]==0:
                    break
        str1=[]
        str2=[]
        for m in range(numMax):
            str1.append([])
            str2.append([])
            p = alignRevPaths[m]
            for i in range(len(p)-1, -1,-1):
                str1[m].append(word1[p[i][0]-1])
                str2[m].append(word2[p[i][1]-1])
        return ([''.join(i) for i in str1],[''.join(i) for i in str2])

代码测试:

word1 = "ATCAT"
word2 = "ATTATC"
med = MED()
str1,str2=  med.longestSubstr(word1, word2)
print "Longest match substr:"
print "match substr in word1:", str1
print "match substr in word2:", str2

输出结果:

Longest match substr:

match substr in word1: ['ATC', 'ATCAT']

match substr in word2: ['ATC', 'ATTAT']

若是把longestSubstr中的

substitue_cost = D_mat[i-1, j-1] - self.subsCost*1

改成:

substitue_cost = 0

便可用来求最大连续子字符串,一样运行上述测试代码,可得:

Longest match substr:

match substr in word1: ['ATC']

match substr in word2: ['ATC']

改用另一组字符串进行实验进行进一步验证:

word1 = "fefnction"
word2 = "faunctional"
med = MED()
str1,str2=  med.longestSubstr(word1, word2)
print "Longest match substr:"
print "match substr in word1:", str1
print "match substr in word2:", str2

当求解的是最长子字符串(非连续)时,输出:

Longest match substr:

match substr in word1: ['nction']

match substr in word2: ['nction']

当求解的是最长连续子字符串时,输出:

Longest match substr:

match substr in word1: ['nction']

match substr in word2: ['nction']

 

在以上的算法中,MED的一个最大的特色就是利用了矩阵保存以前处理的结果数据,以作为下一次的输入。对于最长子字符串的查找,一样套用了MED的框架,但仔细一想咱们会发现,其实最长子串的查找并不必定须要记录路径alignRevPaths,有了alignRevPaths这个路径,咱们编程时是根据这个路径处理字符的。路径的设置主要是为了在computingAlignment中对字符进行对齐,由于字符对齐的状况下,字符不必定是连续,好比会有以下对齐的形式中的“*”号:

clip_image002[10]

因此咱们才想到要用路径作记录。

但在最长子串中,子串确定是连续的,天然路径也就不须要了。

下面对最长子串程序作简化:

def longestSubstr2(self,word1, word2):
        assert type(word1) is types.StringType
        assert type(word2) is types.StringType
        size1 = len(word1)
        size2 = len(word2)
        if size1==0 or size2 ==0:
            return size1*self.delCost+size2*self.insCost
        D_mat = np.zeros((size1+1,size2+1))
        D_mat[:,0] = 0
        D_mat[0,:] = 0
        for i in range(1,size1+1):
            for j in range(1,size2+1):
                insert_cost = D_mat[i-1, j] - self.insCost*1
                delete_cost = D_mat[i, j-1] - self.delCost*1
                if word1[i-1]==word2[j-1]:
                    substitue_cost = D_mat[i-1, j-1] +1
                else:
                    substitue_cost = D_mat[i-1, j-1] - self.subsCost*1
#                     substitue_cost = 0                D_mat[i,j] = max(0,insert_cost, delete_cost, substitue_cost)
        self.D = D_mat
        maxVal = np.max(D_mat)
        maxBool = D_mat == maxVal
        numMax = np.sum(maxBool)
        alignRevPaths=[]
        for i in range(numMax):
            alignRevPaths.append([])
        pathStarts=[]
        for i in range(size1+1):
            for j in range(size2+1):
                if maxBool[i,j]:
                    pathStarts.append([i,j])
        
        str1=[]
        str2=[]
        for m in range(numMax):
            str1.append([])
            str2.append([])
            i = pathStarts[m][0]
            j = pathStarts[m][1]
            s1Tmp = []
            s2Tmp = []
            while D_mat[i,j]!=0:
                s1Tmp.append(word1[i-1])
                s2Tmp.append(word2[j-1])
                i -= 1
                j -= 1
            str1[m]=[s1Tmp[len(s1Tmp)-i-1] for i in range(len(s1Tmp))]
            str2[m]=[s2Tmp[len(s2Tmp)-i-1] for i in range(len(s2Tmp))]
        return ([''.join(i) for i in str1],[''.join(i) for i in str2])

使用如下字符作实验:

word1 = "ATCAT"

word2 = "ATTATC"

输出结果一样为:

Longest match substr:

match substr in word1: ['ATC', 'ATCAT']

match substr in word2: ['ATC', 'ATTAT']

一点我的感悟:

虽然这里的动态规划算法看上去有点难以想象,但联想起线性代数中的矩阵运算也就不难理解了。

就拿最长子串的程序来讲,其实际过程仍可看作:对每word1中的每个字符在word2中进行查找,当匹配第一个字符后继续匹配第二个字符,而后第3、第四……个,直到有字符不匹配时,记录该匹配成功的串长度及字串始末位置。

而这里的Smith-waterman算法将这个匹配记录在一个2维矩阵(数组)中。咱们都知道,线性代数中的矩阵乘法一样是能够展开为各元素的乘与累加操做的。而这里,我认为,发明者正是利用了这一思想。

相关文章
相关标签/搜索