如何找到大字符串的最佳拟合子序列？答案

【问题标题】：How can I find the best fit subsequences of a large string?如何找到大字符串的最佳拟合子序列？
【发布时间】：2017-08-31 21:15:01
【问题描述】：

假设我有一个大字符串和一组子字符串，它们在连接时等于大字符串（差异很小）。

例如（注意字符串之间的细微差别）：

large_str = "hello, this is a long string, that may be made up of multiple
 substrings that approximately match the original string"

sub_strs = ["hello, ths is a lng strin", ", that ay be mad up of multiple",
 "subsrings tat aproimately ", "match the orginal strng"]

如何最好地对齐字符串以从原始large_str 生成一组新的子字符串？例如：

["hello, this is a long string", ", that may be made up of multiple",
 "substrings that approximately ", "match the original string"]

其他信息

其用例是从 PDF 文档中提取的文本的现有分页符中查找原始文本的分页符。从 PDF 中提取的文本经过 OCR 处理，与原始文本相比有小错误，但原始文本没有分页符。目标是准确地对原文进行分页，避免 PDF 文本的 OCR 错误。

【问题讨论】：

这可能是一项复杂的任务。至少我不知道有任何简单的方法来比较字符串的各个部分。您可以使用百分比比较字符串的各个部分，通过将每个字符与 large_str 的一部分进行比较来证明准确性，并查看有多少字符连续匹配
复杂的拆分大字符串以比较各个子字符串。但如果你设法做到这一点，你可以使用 Levenshtein 距离来比较它们。见en.wikipedia.org/wiki/Levenshtein_distance
我能想到的一种方法是基于页面分割算法（也称为自动换行问题）。通常，对于页面分割，我们定义了一个函数来计算分割文本的成本。但是该算法中的该功能基于文本中出现的空格数。我认为我们可以采用类似的方法，但不是在空格的基础上定义拆分函数，而是可以根据字符串的相似性与空格相结合来设计它。这可能是开始并有效构建解决方案的方法之一。

标签： python algorithm levenshtein-distance fuzzy-comparison lcs

【解决方案1】：

连接子字符串
将串联与原始字符串对齐
跟踪原始字符串中的哪些位置与子字符串之间的边界对齐
在与这些边界对齐的位置拆分原始字符串

使用 Python 的difflib 实现：

from difflib import SequenceMatcher
from itertools import accumulate

large_str = "hello, this is a long string, that may be made up of multiple substrings that approximately match the original string"

sub_strs = [
  "hello, ths is a lng strin",
  ", that ay be mad up of multiple",
  "subsrings tat aproimately ",
  "match the orginal strng"]

sub_str_boundaries = list(accumulate(len(s) for s in sub_strs))

sequence_matcher = SequenceMatcher(None, large_str, ''.join(sub_strs), autojunk = False)

match_index = 0
matches = [''] * len(sub_strs)

for tag, i1, i2, j1, j2 in sequence_matcher.get_opcodes():
  if tag == 'delete' or tag == 'insert' or tag == 'replace':
    matches[match_index] += large_str[i1:i2]
    while j1 < j2:
      submatch_len = min(sub_str_boundaries[match_index], j2) - j1
      while submatch_len == 0:
        match_index += 1
        submatch_len = min(sub_str_boundaries[match_index], j2) - j1
      j1 += submatch_len
  else:
    while j1 < j2:
      submatch_len = min(sub_str_boundaries[match_index], j2) - j1
      while submatch_len == 0:
        match_index += 1
        submatch_len = min(sub_str_boundaries[match_index], j2) - j1
      matches[match_index] += large_str[i1:i1+submatch_len]
      j1 += submatch_len
      i1 += submatch_len

print(matches)

输出：

['hello, this is a long string', 
 ', that may be made up of multiple ', 
 'substrings that approximately ', 
 'match the original string']

【讨论】：

这似乎适用于子字符串比原始大字符串生成的子字符串短的情况，但如果它们更长，输出将复制部分字符串。例如，"hello, ths is a lng strinzzzzz" 变为 "hello, this is a long string, th"。
@JoshVoigts 是的，我没有考虑到在replace 操作码中源字符串和目标字符串的长度可能不同。我更新了答案以解决这个问题..

【解决方案2】：

您正在尝试解决序列比对问题。在您的情况下，它是“本地”序列比对。可以通过Smith-Waterman 方法解决。一种可能的实现是here。如果你运行它，你会收到：

large_str = "hello, this is a long string, that may be made up of multiple substrings that approximately match the original string"
sub_strs = ["hello, ths is a lng sin", ", that ay be md up of mulple", "susrings tat aproimately ", "manbvch the orhjgnal strng"]

for sbs in sub_strs:
    water(large_str, sbs)


 >>>

Identity = 85.185 percent
Score = 210
hello, this is a long strin
hello, th s is a l ng s  in
hello, th-s is a l-ng s--in

Identity = 84.848 percent
Score = 255
, that may be made up of multiple
, that  ay be m d  up of mul  ple
, that -ay be m-d- up of mul--ple

Identity = 83.333 percent
Score = 225
substrings that approximately 
su s rings t at a pro imately 
su-s-rings t-at a-pro-imately 

Identity = 75.000 percent
Score = 175
ma--tch the or-iginal string
ma   ch the or  g nal str ng
manbvch the orhjg-nal str-ng

中间行显示匹配的字符。如果您需要位置，请查找 max_i 值以获取 ending 在原始字符串中的位置。开始位置将是i 在water() 函数末尾的值。

【讨论】：

如果large_str 或sub_strs 包含大量重复，我认为这不会很好地工作，因为这并不要求每个sub_str 按顺序只使用一次，等等。考虑：large_str = "hi ha he"; sub_strs = ["hi", "hi", "hi"]; 将从索引 0 开始对齐每个 sub_str。
@AlexVarga 没有算法可以猜出您的想法。如果您希望["hi", "hi", "hi"] 与"hi ha he" 匹配，那么您应该在匹配之前连接子字符串，或者进行迭代匹配，从原始字符串中删除新找到的匹配项。否则，我看不出第一个 "hi" 与第二个和第三个有何不同，以及为什么第二个 "hi" 应该与第一个不同的匹配。

【解决方案3】：

（附加信息使以下很多内容变得不必要。它是为提供的子字符串可能是它们在主字符串中出现顺序的任何排列的情况编写的）

对于与此非常接近的问题，将有一个动态规划解决方案。在为您提供编辑距离的动态规划算法中，动态程序的状态是 (a, b)，其中 a 是第一个字符串的偏移量，b 是第二个字符串的偏移量。对于每一对 (a, b)，您可以计算出与第一个字符串的第一个 a 字符和第二个字符串的前 b 个字符相匹配的最小可能编辑距离，从 (a-1, b) 计算出 (a, b) -1)、(a-1, b) 和 (a, b-1)。

你现在可以用 state (a, n, m, b) 编写一个类似的算法，其中 a 是到目前为止子字符串消耗的字符总数，n 是当前子字符串的索引，m 是在当前子字符串，b 是第二个字符串中匹配的字符数。这解决了将 b 与由任何可用子字符串的任意数量副本粘贴在一起组成的字符串匹配的问题。

这是一个不同的问题，因为如果你试图从片段中重构一个长字符串，你可能会得到一个多次使用同一个片段的解决方案，但如果你这样做，你可能希望答案是显而易见的足以让它产生的子字符串集合恰好是给它的集合的排列。

因为当您强制排列时，此方法返回的编辑距离始终至少与最佳编辑距离一样好，您也可以使用它来计算排列的最佳编辑距离的下限，并且运行分支定界算法以找到最佳排列。

【讨论】：

修改最长公共子序列