【问题标题】:How can I find a match for a string against list of multiple strings如何找到一个字符串与多个字符串列表的匹配项
【发布时间】:2018-05-04 10:26:13
【问题描述】:

我确实有一组字符串,我想要找出输入字符串与现有字符串组的匹配项。这是场景: 我确实有预定义的字符串列表,例如:[Intel,Windows,Google] 输入字符串如下:

'Intel(R) software'

'Intel IT'

'IntelliCAD Technology Consortium'

'Huaian Ningda intelligence Project co.,Ltd'

'Intellon Corporation'

'INTEL\Giovanni'

'Internal - Intel® Identity Protection Technology Software'


'*.google.com'

'GoogleHit'

'http://www.google.com'

'Google Play - Olmsted County'

'Microsoft Windows Component Publisher'

'Microsoft Windows 2000 Publisher'

'Microsoft Windows XP Publisher'

'Windows Embedded Signer'

'Windows Corporation'

'Windows7-PC\Windows7'

有人可以建议我一些 ML 算法或其他一些替代方法来实现最大匹配百分比。 首选语言是 Python。

【问题讨论】:

  • 我根本不知道匹配学习,但你可以用你的正则表达式
  • 所有这些字符串都应该匹配吗?我的意思是,'intelligence' 应该与 'Intel' 匹配吗?
  • 等一下……

标签: python string-matching


【解决方案1】:

您可以为此使用difflib

import difflib

a = ['apple', 'ball', 'pen']
b = ['appel', 'blla', 'epn']

[(i, difflib.get_close_matches(i, a)[0]) for i in b]

输出:

[('appel', 'apple'), ('blla', 'ball'), ('epn', 'pen')]

要查找相似度百分比,您可以使用 SequenceMatcher,如 here 所述。

from difflib import SequenceMatcher

def similar(a, b):
    return SequenceMatcher(None, a, b).ratio()

例如

>>> similar("Apple","Appel")
0.8

【讨论】:

    【解决方案2】:

    使用 re 模块

    import re
    
    love = ['Intel(R) software',
    
    'Intel IT',
    
    'IntelliCAD Technology Consortium',
    
    'Huaian Ningda intelligence Project co.,Ltd',
    
    'Intellon Corporation',
    
    'INTEL\Giovanni',
    
    'Internal - Intel® Identity Protection Technology Software',
    
    '*.google.com',
    
    'GoogleHit',
    
    'http://www.google.com',
    
    'Google Play - Olmsted County',
    
    'Microsoft Windows Component Publisher',
    
    'Microsoft Windows 2000 Publisher',
    
    'Microsoft Windows XP Publisher',
    
    'Windows Embedded Signer',
    
    'Windows Corporation',
    
    'Windows7-PC\Windows7']
    
    match = {}
    counts = {}
    
    regex_words = ['Intel', 'Windows', 'Google']
    no = 0
    
    # for each of the predefined words
    for x in regex_words:
        # new regex we will use for a closer match
        regex = '\s?' + x + '\s'
    
        # items we want to match
        for each in love:
            found = re.findall(x, each)
            if found:
    
                # counting them to get the maximum, (ran out of time)
                counts[no] = len(found)
    
                # here is a closer match, matching with space in front
                if re.findall(regex, each):
                    per = 0.5
                    match[each] = str(per)
    
                # this is an exact match
                elif each == x:
                    per = 0.75
                    match[each] = str(per)
    
                # this is the very first match the ordinary
                else:
                    per = 0.25
                    match[each] = str(per)
    
            no += 1
    
    """ This is the calculation of the score the item made
    for the it's repeatition against the set """
    
    # this will be the mode of the counts
    highest = 0
    
    # start working on the counts
    for y in counts:
    
        # if this is higher than whats already in the highest
        if counts[y] > highest:
    
            # make it the highest
            highest = counts[y]
    
    # index for counts dict
    small_no = 0
    for z in match:
    
        # percentage of what was in the counts for the item compared to the highest
        per = counts[small_no] / highest * 100
    
        # percentage the item gets for the remaining 25 score allocated to all
        score = per / 100 * 25
        total_score = round((score / 100), 2) 
    
        # increment the no. that we are using to iterate the counts
        small_no += 1
    
        # reset the new score for the matchs
        match[z] = str(float(match[z]) + total_score)
    

    会输出

    {'Intel(R) software': '0.37', 'Intel IT': '0.62', 'IntelliCAD Technology Consortium': '0.37', 'Intellon Corporation': '0.37', 'Internal - Intel® Identity Protection Technology Software': '0.37', 'Microsoft Windows Component Publisher': '0.62', 'Microsoft Windows 2000 Publisher': '0.62', 'Microsoft Windows XP Publisher': '0.62', 'Windows Embedded Signer': '0.62', 'Windows Corporation': '0.62', 'Windows7-PC\\Windows7': '0.5', 'GoogleHit': '0.37', 'Google Play - Olmsted County': '0.62'
    

    【讨论】:

      猜你喜欢
      • 2013-05-16
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2017-01-26
      • 2021-08-02
      • 2014-02-11
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多