【发布时间】:2021-02-17 07:46:46
【问题描述】:
我有 1 个 DataFrame 包含 2 列字符串数据。我需要比较列“NameTest”和“Name”。我希望列'NameTest'中的每个名称都比较列'Name'中的所有名称。如果它们匹配超过 80%,则打印最接近的匹配名称。
*我的数据框
| NameTest | Name | |
|---|---|---|
| 0 | john carry | john carrt |
| 1 | alex midlane | john crat |
| 2 | robert patt | alex mid |
| 3 | david baker | alex |
| 4 | NaN | patt |
| 5 | NaN | robert |
| 6 | NaN | david baker |
我的代码
from fuzzywuzzy import fuzz, process
import pandas as pd
import numpy as np
import difflib
cols = ["Name", "NameTest"]
df = pd.read_excel(
r'D:\FFOutput\name.xlsx', usecols=cols,) # Read Excel
for i, row in df.iterrows():
na = row.Name
ne = row.NameTest
print([ne, na])
for i in na:
c = difflib.SequenceMatcher(isjunk=None, a=ne, b=na)
diff = c.ratio()*100
diff = round(diff, 1)
if diff >= 80:
print(na, diff)
有什么建议吗?
感谢您的帮助
【问题讨论】:
标签: python pandas dataframe fuzzywuzzy