【问题标题】:Levenshtein distance with weight/penalty for adjacencyLevenshtein 距离与权重/邻接罚分
【发布时间】:2014-06-24 07:37:52
【问题描述】:

我正在使用字符串编辑距离 (Levenshtein-distance) 来比较眼动追踪实验的扫描路径。 (现在我在 R 中使用 stringdist 包)

基本上,字符串的字母指的是 6x4 矩阵中的(注视)位置。矩阵配置如下:

     [,1] [,2] [,3] [,4]
[1,]  'a'  'g'  'm'  's' 
[2,]  'b'  'h'  'n'  't'
[3,]  'c'  'i'  'o'  'u'
[4,]  'd'  'j'  'p'  'v'
[5,]  'e'  'k'  'q'  'w'
[6,]  'f'  'l'  'r'  'x'

如果我使用基本的 Levenshtein 距离来比较字符串,则字符串中 ag 的比较给出的估计值与 ax 的比较结果相同。

例如:

'abc' compared to 'agc' -> 1
'abc' compared to 'axc' -> 1

这意味着字符串同样(不)相似

我希望能够以一种在矩阵中结合邻接的方式对字符串比较进行加权。例如。 ax 之间的距离应大于ag 之间的距离。

一种方法是计算矩阵中一个字母到另一个字母的“步行”(水平和垂直步数),然后除以最大“步行”距离(即从ax)。例如。从ag 的“步行”距离为1,从ax 的距离为8,分别导致权重为1/8 和1。

有没有办法实现这个(在 R 或 python 中)?

【问题讨论】:

标签: python r levenshtein-distance edit-distance eye-tracking


【解决方案1】:

您需要一个在其内部循环中使用非单位成本的Wagner-Fisher algorithm 版本。 IE。通常的算法有+1,使用+del_cost(a[i])等,并将del_costins_costsub_cost定义为采用一两个符号的函数(可能只是表查找)。

【讨论】:

    【解决方案2】:

    如果有人遇到同样的“问题”,这是我的解决方案。我对 Kyle Gorman 编写的 Wagner-Fischer 算法的 Python 实现做了一个附加组件。

    附加组件是权重函数及其在_dist函数中的实现。

    #!/usr/bin/env python
    # wagnerfischer.py: Dynamic programming Levensthein distance function 
    # Kyle Gorman <gormanky@ohsu.edu>
    # 
    # Based on:
    # 
    # Robert A. Wagner and Michael J. Fischer (1974). The string-to-string 
    # correction problem. Journal of the ACM 21(1):168-173.
    #
    # The thresholding function was inspired by BSD-licensed code from 
    # Babushka, a Ruby tool by Ben Hoskings and others.
    # 
    # Unlike many other Levenshtein distance functions out there, this works 
    # on arbitrary comparable Python objects, not just strings.
    
    
    try: # use numpy arrays if possible...
        from numpy import zeros
        def _zeros(*shape):
            """ like this syntax better...a la MATLAB """
            return zeros(shape)
    
    except ImportError: # otherwise do this cute solution
        def _zeros(*shape):
            if len(shape) == 0:
                return 0
            car = shape[0]
            cdr = shape[1:]
            return [_zeros(*cdr) for i in range(car)]
    
    def weight(A,B, weights): 
        if weights == True:
            from numpy import matrix
            from numpy import where
            # cost_weight defines the matrix structure of the AOI-placement
            cost_weight = matrix([["a","b","c","d","e","f"],["g","h","i","j","k","l"],
            ["m","n","o","p","q","r"],["s","t","u","v","w","x"]])
    
            max_walk = 8.00 # defined as the maximum posible distance between letters in 
                            # the cost_weight matrix
    
            indexA = where(cost_weight==A)
            indexB = where(cost_weight==B)
    
            walk = abs(indexA[0][0]-indexB[0][0])+abs(indexA[1][0]-indexB[1][0])
    
            w = walk/max_walk
    
            return w
        else:
            return 1
    
    def _dist(A, B, insertion, deletion, substitution, weights=True):
        D = _zeros(len(A) + 1, len(B) + 1)
        for i in xrange(len(A)): 
            D[i + 1][0] = D[i][0] + deletion * weight(A[i],B[0], weights)
        for j in xrange(len(B)): 
            D[0][j + 1] = D[0][j] + insertion * weight(A[0],B[j], weights)
        for i in xrange(len(A)): # fill out middle of matrix
            for j in xrange(len(B)):
                if A[i] == B[j]:
                    D[i + 1][j + 1] = D[i][j] # aka, it's free. 
                else:
                    D[i + 1][j + 1] = min(D[i + 1][j] + insertion * weight(A[i],B[j], weights),
                                          D[i][j + 1] + deletion * weight(A[i],B[j], weights),
                                          D[i][j]     + substitution * weight(A[i],B[j], weights))
        return D
    
    def _dist_thresh(A, B, thresh, insertion, deletion, substitution):
        D = _zeros(len(A) + 1, len(B) + 1)
        for i in xrange(len(A)):
            D[i + 1][0] = D[i][0] + deletion
        for j in xrange(len(B)): 
            D[0][j + 1] = D[0][j] + insertion
        for i in xrange(len(A)): # fill out middle of matrix
            for j in xrange(len(B)):
                if A[i] == B[j]:
                    D[i + 1][j + 1] = D[i][j] # aka, it's free. 
                else:
                    D[i + 1][j + 1] = min(D[i + 1][j] + insertion,
                                          D[i][j + 1] + deletion,
                                          D[i][j]     + substitution)
            if min(D[i + 1]) >= thresh:
                return
        return D
    
    def _levenshtein(A, B, insertion, deletion, substitution):
        return _dist(A, B, insertion, deletion, substitution)[len(A)][len(B)]
    
    def _levenshtein_ids(A, B, insertion, deletion, substitution):
        """
        Perform a backtrace to determine the optimal path. This was hard.
        """
        D = _dist(A, B, insertion, deletion, substitution)
        i = len(A) 
        j = len(B)
        ins_c = 0
        del_c = 0
        sub_c = 0
        while True:
            if i > 0:
                if j > 0:
                    if D[i - 1][j] <= D[i][j - 1]: # if ins < del
                        if D[i - 1][j] < D[i - 1][j - 1]: # if ins < m/s
                            ins_c += 1
                        else:
                            if D[i][j] != D[i - 1][j - 1]: # if not m
                                sub_c += 1
                            j -= 1
                        i -= 1
                    else:
                        if D[i][j - 1] <= D[i - 1][j - 1]: # if del < m/s
                            del_c += 1
                        else:
                            if D[i][j] != D[i - 1][j - 1]: # if not m
                                sub_c += 1
                            i -= 1
                        j -= 1
                else: # only insert
                    ins_c += 1
                    i -= 1
            elif j > 0: # only delete
                del_c += 1
                j -= 1
            else: 
                return (ins_c, del_c, sub_c)
    
    
    def _levenshtein_thresh(A, B, thresh, insertion, deletion, substitution):
        D = _dist_thresh(A, B, thresh, insertion, deletion, substitution)
        if D != None:
            return D[len(A)][len(B)]
    
    def levenshtein(A, B, thresh=None, insertion=1, deletion=1, substitution=1):
        """
        Compute levenshtein distance between iterables A and B
        """
        # basic checks
        if len(A) == len(B) and A == B:
            return 0       
        if len(B) > len(A):
            (A, B) = (B, A)
        if len(A) == 0:
            return len(B)
        if thresh:
            if len(A) - len(B) > thresh:
                return
            return _levenshtein_thresh(A, B, thresh, insertion, deletion,
                                                                substitution)
        else: 
            return _levenshtein(A, B, insertion, deletion, substitution)
    
    def levenshtein_ids(A, B, insertion=1, deletion=1, substitution=1):
        """
        Compute number of insertions deletions, and substitutions for an 
        optimal alignment.
        There may be more than one, in which case we disfavor substitution.
        """
        # basic checks
        if len(A) == len(B) and A == B:
            return (0, 0, 0)
        if len(B) > len(A):
            (A, B) = (B, A)
        if len(A) == 0:
            return len(B)
        else: 
            return _levenshtein_ids(A, B, insertion, deletion, substitution)
    

    【讨论】:

      【解决方案3】:

      查看这个库:https://github.com/infoscout/weighted-levenshtein(免责声明:我是作者)。它支持加权 Levenshtein 距离、加权 Optimal String Alignment 和加权 Damerau-Levenshtein 距离。它是用 Cython 编写的以获得最佳性能,并且可以通过 pip install weighted-levenshtein 轻松安装。欢迎提供反馈和拉取请求。

      示例用法:

      import numpy as np
      from weighted_levenshtein import lev
      
      
      insert_costs = np.ones(128, dtype=np.float64)  # make an array of all 1's of size 128, the number of ASCII characters
      insert_costs[ord('D')] = 1.5  # make inserting the character 'D' have cost 1.5 (instead of 1)
      
      # you can just specify the insertion costs
      # delete_costs and substitute_costs default to 1 for all characters if unspecified
      print lev('BANANAS', 'BANDANAS', insert_costs=insert_costs)  # prints '1.5'
      

      【讨论】:

        【解决方案4】:

        另一个处理权重的选项(Python 3.5)——我不隶属于它——是https://github.com/luozhouyang/python-string-similarity

        pip install strsim
        

        【讨论】:

          猜你喜欢
          • 1970-01-01
          • 2021-02-01
          • 2018-03-05
          • 1970-01-01
          • 2010-09-07
          • 1970-01-01
          • 2014-04-14
          • 2014-08-04
          • 2015-01-28
          相关资源
          最近更新 更多