在 pandas 中类似地基于字符串创建集群答案

【问题标题】：Create clusters based on string similarly in pandas在 pandas 中类似地基于字符串创建集群
【发布时间】：2021-04-08 11:54:11
【问题描述】：

我有一个大约 200-300k 的名字列表

例如。

Names1	Names2
Mr.Reven	Alex
Freddie	Keven
Miss.Grey	Moeen
James	Shayne
Neoveeen	Frey
Boult	mcKay
Dr.Alen	Adames
Alsray	Miss. Slout

Names1 应该与 Names2 的每个值进行比较，然后我的 pandas 代码应该创建不同的集群，如 cluster-1、Cluster-2、Cluster-3 等。在这些集群中应该有一个相似名称的列表（去除敬语的前缀或后缀）同样大于 90%

例如。

Cluster-1	Cluster-2	Cluster-3
Frey	Reven	Moeen
Grey	Keven	Neoveeen

有没有办法在熊猫中做到这一点？

【问题讨论】：

你炒过模糊匹配吗？
您对相似名称的准确定义是什么，如何计算您想要的相似值？看起来您希望名称具有相同顺序的相同字符，但是如果两个名称相似地大于 90%，您如何定义？
@Ukrainian-serge 我试过了，但不是为了这些逻辑，只是为了理解，我只是用一个非常小的例子来尝试。
@Mr.例如，订单不一定要匹配。从某种意义上说，90% 匹配，有很多算法，如 leveishtin dist.、Jaro Winkler、Fuzzy wuzzy 等，其中最好的算法应该用于这些特定逻辑，它给我精确的分数，基于那个分数，然后收集那些名字更高的匹配百分比。现在我的解释清楚了吗？请让我知道是否已经说清楚了。
@RedVibes 所以，我只想知道，我的回答是否解决了这个问题，或者你认为我错过了什么？

标签： pandas dataframe cluster-analysis

【解决方案1】：

基于this similarity metric的示例代码：

import pandas as pd
from difflib import SequenceMatcher
import numpy as np
import re

def similar(a, b):
    return SequenceMatcher(None, a, b).ratio()

def remove_prefix(s):
    return re.split('\.| |_|-', s)[-1]

# Mimic dataframe
d = {'Names1': ['Mr.Reven', 'Freddie', 'Miss.Grey', 'James', 'Neoveeen', 'Boult', 'Dr.Alen', 'Alsray'], 
     'Names2': ['Alex', 'Keven', 'Moeen', 'Shayne', 'Frey', 'mcKay', 'Adames', 'Miss. Slout']}
df = pd.DataFrame(d)

# Get two list names with remove prefix
remove_prefix_fv = np.vectorize(remove_prefix)
names1 = remove_prefix_fv(df['Names1'].to_numpy())
names2 = remove_prefix_fv(df['Names2'].to_numpy())

# Get similarity scores for each pairs between Names1 and Names2
similar_fv = np.vectorize(similar)
scores = similar_fv(names1[:, np.newaxis], names2)

# Filter out the pairs above the threshold
threshold = 0.7
ind = np.where(scores >= threshold)

# Cluster the Names2 elements with same Names1 element
uc = np.unique(ind[0])
cd = {"Cluster-" + str(i): [names1[uc[i]]] + list(names2[ind[1][np.where(ind[0] == uc[i])[0]]]) for i in range(len(uc))}

# Build the dataframe
cdf = pd.DataFrame(cd)
print(cdf)

输出：

  Cluster-0 Cluster-1 Cluster-2 Cluster-3
0     Reven      Grey     James      Alen
1     Keven      Frey    Adames      Alex

【讨论】：