【问题标题】:Getting incorrect score from fuzzy wuzzy partial_ratio从模糊模糊 partial_ratio 获得不正确的分数
【发布时间】:2016-09-27 15:56:17
【问题描述】:

我对 Python 还很陌生,我正在尝试使用模糊 wuzzy 进行模糊匹配。我相信我使用 partial_ratio 函数得到的匹配分数不正确。这是我的探索代码:

>>>from fuzzywuzzy import fuzz
>>>fuzz.partial_ratio('Subject: Dalki Manganese Ore Mine of M/S Bharat Process and Mechanical Engineers Ltd., Villages Dalki, Soyabahal, Sading and Thakurani R.F., Tehsil Barbil, Distt, Keonjhar, Orissa environmental clearance','Barbil')
50

我相信这应该返回 100 分,因为第二个字符串“Barbil”包含在第一个字符串中。当我尝试在第一个字符串的末尾或开头删除几个字符时,我得到了 100 的匹配分数。

>>>fuzz.partial_ratio('Subject: Dalki Manganese Ore Mine of M/S Bharat Process and Mechanical Engineers Ltd., Villages Dalki, Soyabahal, Sading and Thakurani R.F., Tehsil Barbil, Distt, Keonjhar, Orissa environmental clear','Barbil')
100
>>> fuzz.partial_ratio('ect: Dalki Manganese Ore Mine of M/S Bharat Process and Mechanical Engineers Ltd., Villages Dalki, Soyabahal, Sading and Thakurani R.F., Tehsil Barbil, Distt, Keonjhar, Orissa environmental clearance','Orissa')
100

当第一个字符串的长度变为 199 时,它似乎从 50 分变为 100 分。有没有人知道可能发生的情况?

【问题讨论】:

    标签: python fuzzy-comparison fuzzywuzzy


    【解决方案1】:

    这是因为当其中一个字符串是200 characters or longer, an automatic junk heuristic gets turned on in python's SequenceMatcher 时。 这段代码应该适合你:

    from difflib import SequenceMatcher
    
    def partial_ratio(s1, s2):
        """"Return the ratio of the most similar substring
        as a number between 0 and 100."""
    
        if len(s1) <= len(s2):
            shorter = s1
            longer = s2
        else:
            shorter = s2
            longer = s1
    
        m = SequenceMatcher(None, shorter, longer, autojunk=False)
        blocks = m.get_matching_blocks()
    
        # each block represents a sequence of matching characters in a string
        # of the form (idx_1, idx_2, len)
        # the best partial match will block align with at least one of those blocks
        #   e.g. shorter = "abcd", longer = XXXbcdeEEE
        #   block = (1,3,3)
        #   best score === ratio("abcd", "Xbcd")
        scores = []
        for (short_start, long_start, _) in blocks:
            long_end = long_start + len(shorter)
            long_substr = longer[long_start:long_end]
    
            m2 = SequenceMatcher(None, shorter, long_substr, autojunk=False)
            r = m2.ratio()
            if r > .995:
                return 100
            else:
                scores.append(r)
    
        return max(scores) * 100.0
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2014-03-07
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2017-04-04
      相关资源
      最近更新 更多