如何将发音相似的词放在一起答案

【问题标题】：How to get the similar-sounding words together如何将发音相似的词放在一起
【发布时间】：2019-08-15 08:03:15
【问题描述】：

我正在尝试从列表中获取所有发音相似的单词。

我尝试使用余弦相似度来获取它们，但这不能满足我的目的。

from sklearn.metrics.pairwise import cosine_similarity
dataList = ['two','fourth','forth','dessert','to','desert']
cosine_similarity(dataList)

我知道这不是正确的方法，我似乎无法得到如下结果：

result = ['xx', 'xx', 'yy', 'yy', 'zz', 'zz']

它们的意思是听起来相似的词

【问题讨论】：

标签： python python-3.x list

【解决方案1】：

首先，您需要使用正确的方法来获得发音相似的单词，即字符串相似度，我建议：

使用jellyfish：

from jellyfish import soundex

print(soundex("two"))
print(soundex("to"))

输出：

T000
T000

现在，也许，创建一个处理列表的函数，然后对其进行排序以获取它们：

def getSoundexList(dList):
    res = [soundex(x) for x in dList]   # iterate over each elem in the dataList
    # print(res)     # ['T000', 'F630', 'F630', 'D263', 'T000', 'D263']
    return res

dataList = ['two','fourth','forth','dessert','to','desert']    
print([x for x in sorted(getSoundexList(dataList))])

输出：

['D263', 'D263', 'F630', 'F630', 'T000', 'T000']

编辑：

另一种方式可能是：

使用fuzzy：

import fuzzy
soundex = fuzzy.Soundex(4)

print(soundex("to"))
print(soundex("two"))

输出：

T000
T000

编辑 2：

如果你想要他们grouped，你可以使用 groupby：

from itertools import groupby

def getSoundexList(dList):
    return sorted([soundex(x) for x in dList])

dataList = ['two','fourth','forth','dessert','to','desert']    
print([list(g) for _, g in groupby(getSoundexList(dataList), lambda x: x)])

输出：

[['D263', 'D263'], ['F630', 'F630'], ['T000', 'T000']]

编辑 3：

这是给@Eric Duminil 的，假设你想要names 和它们各自的val：

将dict 与itemgetter 一起使用：

from operator import itemgetter

def getSoundexDict(dList):
    return sorted(dict_.items(), key=itemgetter(1))  # sorting the dict_ on val

dataList = ['two','fourth','forth','dessert','to','desert']
res = [soundex(x) for x in dataList]    # to get the val for each elem
dict_ = dict(list(zip(dataList, res)))  # dict_ with k,v as name/val

print([list(g) for _, g in groupby(getSoundexDict(dataList), lambda x: x[1])])

输出：

[[('dessert', 'D263'), ('desert', 'D263')], [('fourth', 'F630'), ('forth', 'F630')], [('two', 'T000'), ('to', 'T000')]]

EDIT 4（用于 OP）：

Soundex：

Soundex 是一个系统，通过该系统将值分配给这样的名称听起来相似的名称获得相同值的方式。这些值被称为 soundex 编码。基于soundex的搜索应用不会直接搜索名称，而是会搜索 soundex 编码。通过这样做，它将获得所有听起来的名字就像正在寻找的名字一样。