如何找到一个字符串与多个字符串列表的匹配项答案

【问题标题】：How can I find a match for a string against list of multiple strings如何找到一个字符串与多个字符串列表的匹配项
【发布时间】：2018-05-04 10:26:13
【问题描述】：

我确实有一组字符串，我想要找出输入字符串与现有字符串组的匹配项。这是场景：我确实有预定义的字符串列表，例如：[Intel，Windows，Google] 输入字符串如下：

'Intel(R) software'

'Intel IT'

'IntelliCAD Technology Consortium'

'Huaian Ningda intelligence Project co.,Ltd'

'Intellon Corporation'

'INTEL\Giovanni'

'Internal - Intel® Identity Protection Technology Software'


'*.google.com'

'GoogleHit'

'http://www.google.com'

'Google Play - Olmsted County'

'Microsoft Windows Component Publisher'

'Microsoft Windows 2000 Publisher'

'Microsoft Windows XP Publisher'

'Windows Embedded Signer'

'Windows Corporation'

'Windows7-PC\Windows7'

有人可以建议我一些 ML 算法或其他一些替代方法来实现最大匹配百分比。首选语言是 Python。

【问题讨论】：

我根本不知道匹配学习，但你可以用你的正则表达式
所有这些字符串都应该匹配吗？我的意思是，'intelligence' 应该与 'Intel' 匹配吗？
等一下……

标签： python string-matching

【解决方案1】：

您可以为此使用difflib：

import difflib

a = ['apple', 'ball', 'pen']
b = ['appel', 'blla', 'epn']

[(i, difflib.get_close_matches(i, a)[0]) for i in b]

输出：

[('appel', 'apple'), ('blla', 'ball'), ('epn', 'pen')]

要查找相似度百分比，您可以使用 SequenceMatcher，如 here 所述。

from difflib import SequenceMatcher

def similar(a, b):
    return SequenceMatcher(None, a, b).ratio()

例如

>>> similar("Apple","Appel")
0.8

【讨论】：

【解决方案2】：

使用 re 模块

import re

love = ['Intel(R) software',

'Intel IT',

'IntelliCAD Technology Consortium',

'Huaian Ningda intelligence Project co.,Ltd',

'Intellon Corporation',

'INTEL\Giovanni',

'Internal - Intel® Identity Protection Technology Software',

'*.google.com',

'GoogleHit',

'http://www.google.com',

'Google Play - Olmsted County',

'Microsoft Windows Component Publisher',

'Microsoft Windows 2000 Publisher',

'Microsoft Windows XP Publisher',

'Windows Embedded Signer',

'Windows Corporation',

'Windows7-PC\Windows7']

match = {}
counts = {}

regex_words = ['Intel', 'Windows', 'Google']
no = 0

# for each of the predefined words
for x in regex_words:
    # new regex we will use for a closer match
    regex = '\s?' + x + '\s'

    # items we want to match
    for each in love:
        found = re.findall(x, each)
        if found:

            # counting them to get the maximum, (ran out of time)
            counts[no] = len(found)

            # here is a closer match, matching with space in front
            if re.findall(regex, each):
                per = 0.5
                match[each] = str(per)

            # this is an exact match
            elif each == x:
                per = 0.75
                match[each] = str(per)

            # this is the very first match the ordinary
            else:
                per = 0.25
                match[each] = str(per)

        no += 1

""" This is the calculation of the score the item made
for the it's repeatition against the set """

# this will be the mode of the counts
highest = 0

# start working on the counts
for y in counts:

    # if this is higher than whats already in the highest
    if counts[y] > highest:

        # make it the highest
        highest = counts[y]

# index for counts dict
small_no = 0
for z in match:

    # percentage of what was in the counts for the item compared to the highest
    per = counts[small_no] / highest * 100

    # percentage the item gets for the remaining 25 score allocated to all
    score = per / 100 * 25
    total_score = round((score / 100), 2) 

    # increment the no. that we are using to iterate the counts
    small_no += 1

    # reset the new score for the matchs
    match[z] = str(float(match[z]) + total_score)

会输出

{'Intel(R) software': '0.37', 'Intel IT': '0.62', 'IntelliCAD Technology Consortium': '0.37', 'Intellon Corporation': '0.37', 'Internal - Intel® Identity Protection Technology Software': '0.37', 'Microsoft Windows Component Publisher': '0.62', 'Microsoft Windows 2000 Publisher': '0.62', 'Microsoft Windows XP Publisher': '0.62', 'Windows Embedded Signer': '0.62', 'Windows Corporation': '0.62', 'Windows7-PC\\Windows7': '0.5', 'GoogleHit': '0.37', 'Google Play - Olmsted County': '0.62'

【讨论】：