通过声音相似度确定弦之间的距离答案

【问题标题】：Distance between strings by similarity of sound通过声音相似度确定弦之间的距离
【发布时间】：2021-06-17 06:57:45
【问题描述】：

两个词之间相似性的定量描述是否基于它们的发音/发音方式，类似于 Levenshtein 距离？

我知道 soundex 为 similar sounding 单词提供了相同的 id，但据我所知，这不是单词之间差异的定量描述。

from jellyfish import soundex

print(soundex("two"))
print(soundex("to"))

【问题讨论】：

这个问题和Python有什么关系？这听起来像是一个非常广泛、笼统的理论问题，与代码或实现无关。
我需要一个 python 或数学实现。

标签： python audio nlp linguistics

【解决方案1】：

您可以结合语音编码和字符串比较算法。事实上，jellyfish 两者都有。

设置库示例

from jellyfish import soundex, metaphone, nysiis, match_rating_codex,\
    levenshtein_distance, damerau_levenshtein_distance, hamming_distance,\
    jaro_similarity
from itertools import groupby
import pandas as pd
import numpy as np


dataList = ['two','too','to','fourth','forth','dessert',
            'desert','Byrne','Boern','Smith','Smyth','Catherine','Kathryn']

sounds_encoding_methods = [soundex, metaphone, nysiis, match_rating_codex]

让我们比较不同的拼音编码

report = pd.DataFrame([dataList]).T
report.columns = ['word']
for i in sounds_encoding_methods:
    print(i.__name__)
    report[i.__name__]= report['word'].apply(lambda x: i(x))
print(report)
          soundex metaphone   nysiis match_rating_codex
word                                                   
two          T000        TW       TW                 TW
too          T000         T        T                  T
to           T000         T        T                  T
fourth       F630       FR0     FART               FRTH
forth        F630       FR0     FART               FRTH
dessert      D263      TSRT    DASAD               DSRT
desert       D263      TSRT    DASAD               DSRT
Byrne        B650       BRN     BYRN               BYRN
Boern        B650       BRN     BARN                BRN
Smith        S530       SM0     SNAT               SMTH
Smyth        S530       SM0     SNYT              SMYTH
Catherine    C365      K0RN  CATARAN              CTHRN
Kathryn      K365      K0RN   CATRYN             KTHRYN

您可以看到语音编码在使单词具有可比性方面做得很好。您可能会看到不同的情况，并根据您的情况选择一种或另一种。

现在我将采用上述方法并尝试使用 levenshtein_distance 找到最接近的匹配项，但我也可以为您提供任何其他匹配项。

"""Select the closer by algorithm
for instance levenshtein_distance"""
report2 = pd.DataFrame([dataList]).T
report2.columns = ['word']

report.set_index('word',inplace=True)
report2 = report.copy()
for sounds_encoding in sounds_encoding_methods:
    report2[sounds_encoding.__name__] = np.nan
    matched_words = []
    for word in dataList:
        closest_list = []
        for word_2 in dataList:
            if word != word_2:
                closest = {}
                closest['word'] =  word_2
                closest['similarity'] = levenshtein_distance(report.loc[word,sounds_encoding.__name__],
                                     report.loc[word_2,sounds_encoding.__name__])
                closest_list.append(closest)

        report2.loc[word,sounds_encoding.__name__] = pd.DataFrame(closest_list).\
            sort_values(by = 'similarity').head(1)['word'].values[0]

print(report2)
             soundex  metaphone     nysiis match_rating_codex
word                                                         
two              too        too        too                too
too              two         to         to                 to
to               two        too        too                too
fourth         forth      forth      forth              forth
forth         fourth     fourth     fourth             fourth
dessert       desert     desert     desert             desert
desert       dessert    dessert    dessert            dessert
Byrne          Boern      Boern      Boern              Boern
Boern          Byrne      Byrne      Byrne              Byrne
Smith          Smyth      Smyth      Smyth              Smyth
Smyth          Smith      Smith      Smith              Smith
Catherine    Kathryn    Kathryn    Kathryn            Kathryn
Kathryn    Catherine  Catherine  Catherine          Catherine

从上面您可以清楚地看到，语音编码和字符串比较算法之间的组合可以非常简单。

【讨论】：

但是你构建的距离，是量化的吗？从某种意义上说，如果一个人有 3 个词：a、b、c；相似度(a,b) > 相似度(a,c) your_dist(a,b) > your_dist(a,c) 关系是真的吗？
是的，我正在使用问题措辞中建议的 levenshtein_distance。