首先,您需要使用正确的方法来获得发音相似的单词,即字符串相似度,我建议:
使用jellyfish:
from jellyfish import soundex
print(soundex("two"))
print(soundex("to"))
输出:
T000
T000
现在,也许,创建一个处理列表的函数,然后对其进行排序以获取它们:
def getSoundexList(dList):
res = [soundex(x) for x in dList] # iterate over each elem in the dataList
# print(res) # ['T000', 'F630', 'F630', 'D263', 'T000', 'D263']
return res
dataList = ['two','fourth','forth','dessert','to','desert']
print([x for x in sorted(getSoundexList(dataList))])
输出:
['D263', 'D263', 'F630', 'F630', 'T000', 'T000']
编辑:
另一种方式可能是:
使用fuzzy:
import fuzzy
soundex = fuzzy.Soundex(4)
print(soundex("to"))
print(soundex("two"))
输出:
T000
T000
编辑 2:
如果你想要他们grouped,你可以使用 groupby:
from itertools import groupby
def getSoundexList(dList):
return sorted([soundex(x) for x in dList])
dataList = ['two','fourth','forth','dessert','to','desert']
print([list(g) for _, g in groupby(getSoundexList(dataList), lambda x: x)])
输出:
[['D263', 'D263'], ['F630', 'F630'], ['T000', 'T000']]
编辑 3:
这是给@Eric Duminil 的,假设你想要names 和它们各自的val:
将dict 与itemgetter 一起使用:
from operator import itemgetter
def getSoundexDict(dList):
return sorted(dict_.items(), key=itemgetter(1)) # sorting the dict_ on val
dataList = ['two','fourth','forth','dessert','to','desert']
res = [soundex(x) for x in dataList] # to get the val for each elem
dict_ = dict(list(zip(dataList, res))) # dict_ with k,v as name/val
print([list(g) for _, g in groupby(getSoundexDict(dataList), lambda x: x[1])])
输出:
[[('dessert', 'D263'), ('desert', 'D263')], [('fourth', 'F630'), ('forth', 'F630')], [('two', 'T000'), ('to', 'T000')]]
EDIT 4(用于 OP):
Soundex:
Soundex 是一个系统,通过该系统将值分配给这样的名称
听起来相似的名称获得相同值的方式。这些值
被称为 soundex 编码。基于soundex的搜索应用
不会直接搜索名称,而是会搜索
soundex 编码。通过这样做,它将获得所有听起来的名字
就像正在寻找的名字一样。
read more..