从模糊模糊 partial_ratio 获得不正确的分数答案

【问题标题】：Getting incorrect score from fuzzy wuzzy partial_ratio从模糊模糊 partial_ratio 获得不正确的分数
【发布时间】：2016-09-27 15:56:17
【问题描述】：

我对 Python 还很陌生，我正在尝试使用模糊 wuzzy 进行模糊匹配。我相信我使用 partial_ratio 函数得到的匹配分数不正确。这是我的探索代码：

>>>from fuzzywuzzy import fuzz
>>>fuzz.partial_ratio('Subject: Dalki Manganese Ore Mine of M/S Bharat Process and Mechanical Engineers Ltd., Villages Dalki, Soyabahal, Sading and Thakurani R.F., Tehsil Barbil, Distt, Keonjhar, Orissa environmental clearance','Barbil')
50

我相信这应该返回 100 分，因为第二个字符串“Barbil”包含在第一个字符串中。当我尝试在第一个字符串的末尾或开头删除几个字符时，我得到了 100 的匹配分数。

>>>fuzz.partial_ratio('Subject: Dalki Manganese Ore Mine of M/S Bharat Process and Mechanical Engineers Ltd., Villages Dalki, Soyabahal, Sading and Thakurani R.F., Tehsil Barbil, Distt, Keonjhar, Orissa environmental clear','Barbil')
100
>>> fuzz.partial_ratio('ect: Dalki Manganese Ore Mine of M/S Bharat Process and Mechanical Engineers Ltd., Villages Dalki, Soyabahal, Sading and Thakurani R.F., Tehsil Barbil, Distt, Keonjhar, Orissa environmental clearance','Orissa')
100

当第一个字符串的长度变为 199 时，它似乎从 50 分变为 100 分。有没有人知道可能发生的情况？

【问题讨论】：

标签： python fuzzy-comparison fuzzywuzzy

【解决方案1】：

这是因为当其中一个字符串是200 characters or longer, an automatic junk heuristic gets turned on in python's SequenceMatcher 时。这段代码应该适合你：

from difflib import SequenceMatcher

def partial_ratio(s1, s2):
    """"Return the ratio of the most similar substring
    as a number between 0 and 100."""

    if len(s1) <= len(s2):
        shorter = s1
        longer = s2
    else:
        shorter = s2
        longer = s1

    m = SequenceMatcher(None, shorter, longer, autojunk=False)
    blocks = m.get_matching_blocks()

    # each block represents a sequence of matching characters in a string
    # of the form (idx_1, idx_2, len)
    # the best partial match will block align with at least one of those blocks
    #   e.g. shorter = "abcd", longer = XXXbcdeEEE
    #   block = (1,3,3)
    #   best score === ratio("abcd", "Xbcd")
    scores = []
    for (short_start, long_start, _) in blocks:
        long_end = long_start + len(shorter)
        long_substr = longer[long_start:long_end]

        m2 = SequenceMatcher(None, shorter, long_substr, autojunk=False)
        r = m2.ratio()
        if r > .995:
            return 100
        else:
            scores.append(r)

    return max(scores) * 100.0

【讨论】：