【问题标题】:Pandas affects results of rapidfuzz match?熊猫会影响快速模糊匹配的结果吗?
【发布时间】:2021-07-29 06:13:04
【问题描述】:

我用这个碰壁了。如果我在 pandas 数据帧中运行 Rapidfuzz 并且如果我自己运行它,Rapidfuzz 会为字符串分数相似性提供不同的结果?为什么 Adress Similarity 2 和最后一行的结果不同?

from rapidfuzz import process, utils, fuzz
import pandas as pd
import numpy as np

address_a = 'high new technology development zones huainan city anhui province china anhui anhui any city'
address_b = 'industrial park of funan city'

test_anui_data = {'Processed Client Name': ['anhui jinhan clothing co ltd'], 'Processed Aruvio Name': ['anhui jinhan clothing co ltd'], 'Processed Client Address': [address_a], 'Processed Aruvio Address': [address_b],  'Name Similarity': [89.2857142857142],  'Address Similarity': [np.nan]}  
  
# Create DataFrame  
test_anui = pd.DataFrame(test_anui_data)  
test_anui

test_anui= test_anui[(test_anui['Address Similarity'].isnull()) & (test_anui['Address Similarity']!='')]
test_anui['Address Similarity 2'] = fuzz.token_sort_ratio(str(test_anui['Processed Client Address']), str(test_anui['Processed Aruvio Address']))
print('the address similarity is different? ', fuzz.token_sort_ratio(address_a, address_b))

【问题讨论】:

  • 问题:你从哪里得到'Name Similarity': [89.2857142857142], 'Address Similarity': [np.nan]
  • 以创建它们为例
  • 您是否也发现了不同的结果?这怎么可能??

标签: python pandas


【解决方案1】:

错误来自您在应用模糊测试时调用了整个列。如果您执行以下操作,即对单个行应用 fuzz,您会得到相同的结果:

test_anui= test_anui[(test_anui['Address Similarity'].isnull()) & (test_anui['Address Similarity']!='')]
test_anui['Address Similarity 2'] = fuzz.token_sort_ratio(str(test_anui.at[0,'Processed Client Address']), str(test_anui.at[0,'Processed Aruvio Address']))

print('the address similarity is different? ', fuzz.token_sort_ratio(address_a, address_b))

或者,使用.loc

test_anui= test_anui[(test_anui['Address Similarity'].isnull()) & (test_anui['Address Similarity']!='')]
test_anui['Address Similarity 2'] = fuzz.token_sort_ratio(str(test_anui.loc[0,'Processed Client Address']), str(test_anui.loc[0,'Processed Aruvio Address']))

print('the address similarity is different? ', fuzz.token_sort_ratio(address_a, address_b))

数据框中的输出是:

    Processed Client Name         Processed Aruvio Name  \
0  anhui jinhan clothing co ltd  anhui jinhan clothing co ltd   

                            Processed Client Address  \
0  high new technology development zones huainan ...   

        Processed Aruvio Address  Name Similarity  Address Similarity  \
0  industrial park of funan city        89.285714                 NaN   

   Address Similarity 2  
0             28.099174  

fuzz.token_sort_ratio(address_a, address_b)28.099173553719012

换句话说,您需要指定您打算从哪一行中提取字符串。我想您的数据框由几行组成,这意味着您必须为每一行执行此操作:

for i in len(test_anui):
    test_anui['Address Similarity 2'] = fuzz.token_sort_ratio(str(test_anui.loc[i,'Processed Client Address']), 
    str(test_anui.loc[i,'Processed Aruvio Address']))

【讨论】:

  • 塞尔吉,你这个传奇。它与此解决方案有何不同? # for i, row in test_anui.iterrows(): # if pd.isnull(row['Address Similarity']) == True: # potentialMatchesc.loc[i, 'Address Similarity'] = fuzz.token_sort_ratio(str(row ['Processed Client Address']), str(row['Processed Aruvio Address'])) # if pd.isnull(row['Name Similarity']) == True: # potentialMatchesc.loc[i, 'Name Similarity' ] = fuzz.token_sort_ratio(str(row['Processed Client Name']), str(row['Processed Aruvio Name']))
  • 你也为整个数据集提供了解决方案,但我看不到了吗?
  • 完全一样。做得好。如果我的答案是您正在寻找的,请务必将其标记为已接受。祝你今天过得愉快!至于您对整个数据集的解决方案的第二条评论。我没有改变任何东西。应该在那里。
猜你喜欢
  • 1970-01-01
  • 2021-11-09
  • 1970-01-01
  • 2016-11-11
  • 2021-02-18
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2015-06-22
相关资源
最近更新 更多