【问题标题】:Optimizing the damerau version of the levenshtein algorithm to better than O(n*m)优化 levenshtein 算法的 damerau 版本,使其优于 O(n*m)
【发布时间】:2012-04-12 10:56:04
【问题描述】:

这是算法(在 ruby​​ 中)

#http://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance
  def self.dameraulevenshtein(seq1, seq2)
      oneago = nil
      thisrow = (1..seq2.size).to_a + [0]
      seq1.size.times do |x|
          twoago, oneago, thisrow = oneago, thisrow, [0] * seq2.size + [x + 1]
          seq2.size.times do |y|
              delcost = oneago[y] + 1
              addcost = thisrow[y - 1] + 1
              subcost = oneago[y - 1] + ((seq1[x] != seq2[y]) ? 1 : 0)
              thisrow[y] = [delcost, addcost, subcost].min
              if (x > 0 and y > 0 and seq1[x] == seq2[y-1] and seq1[x-1] == seq2[y] and seq1[x] != seq2[y])
                  thisrow[y] = [thisrow[y], twoago[y-2] + 1].min
              end
          end
      end
      return thisrow[seq2.size - 1]
  end

我的问题是,长度为 780 的 seq1 和长度为 7238 的 seq2,在 i7 笔记本电脑上运行大约需要 25 秒。理想情况下,我希望将其缩短到大约一秒钟,因为它是作为 web 应用程序的一部分运行的。

我发现there is a way to optimize the vanilla levenshtein distance 使得运行时间从 O(n*m) 下降到 O(n + d^2),其中 n 是较长字符串的长度,d 是编辑距离。那么,我的问题就变成了,可以将相同的优化应用到我拥有的 damerau 版本(上面)吗?

【问题讨论】:

  • 你看过Levenshtein Automata吗?
  • 您是否需要知道确切的距离,或者距离是否低于某个阈值?前者比后者难。

标签: ruby string algorithm optimization


【解决方案1】:

是的,优化可以应用于 damereau 版本。这是执行此操作的 haskell 代码(我不了解 Ruby):

distd :: Eq a => [a] -> [a] -> Int
distd a b
    = last (if lab == 0 then mainDiag
            else if lab > 0 then lowers !! (lab - 1)
                 else{- < 0 -}   uppers !! (-1 - lab))
    where mainDiag = oneDiag a b (head uppers) (-1 : head lowers)
          uppers = eachDiag a b (mainDiag : uppers) -- upper diagonals
          lowers = eachDiag b a (mainDiag : lowers) -- lower diagonals
          eachDiag a [] diags = []
          eachDiag a (bch:bs) (lastDiag:diags) = oneDiag a bs nextDiag lastDiag : eachDiag a bs diags
              where nextDiag = head (tail diags)
          oneDiag a b diagAbove diagBelow = thisdiag
              where doDiag [_] b nw n w = []
                    doDiag a [_] nw n w = []
                    doDiag (apr:ach:as) (bpr:bch:bs) nw n w = me : (doDiag (ach:as) (bch:bs) me (tail n) (tail w))
                        where me = if ach == bch then nw else if ach == bpr && bch == apr then nw else 1 + min3 (head w) nw (head n)
                    firstelt = 1 + head diagBelow
                    thisdiag = firstelt : doDiag a b firstelt diagAbove (tail diagBelow)
          lab = length a - length b
          min3 x y z = if x < y then x else min y z

distance :: [Char] -> [Char] -> Int
distance a b = distd ('0':a) ('0':b)

上面的代码是this code的改编。

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 2019-12-23
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多