使用 SequenceMatcher 比较 pandas 中两列中的字符串答案

【问题标题】：Comparing strings within two columns in pandas with SequenceMatcher使用 SequenceMatcher 比较 pandas 中两列中的字符串
【发布时间】：2020-08-12 18:30:48
【问题描述】：

我正在尝试确定 pandas 数据框中两列的相似性：

Text1                                                                             All
Performance results achieved by the approaches submitted to this Challenge.       The six top approaches and three others outperform the strong baseline.
Accuracy is one of the basic principles of perfectionist.                             Where am I?

我想将'Performance results ... ' 与'The six...' 和'Accuracy is one...' 与'Where am I?' 进行比较。第一行应该是两列之间的相似度较高，因为它包含一些单词；第二个应该等于 0，因为两列之间没有共同的单词。

比较我使用SequenceMatcher 的两列如下：

from difflib import SequenceMatcher

ratio = SequenceMatcher(None, df.Text1, df.All).ratio()

但使用df.Text1, df.All似乎是错误的。

你能告诉我为什么吗？

【问题讨论】：

标签： python pandas nlp sequencematcher

【解决方案1】：

SequenceMatcher 不是为熊猫系列设计的。
你可以.apply这个函数。
SequenceMatcher Examples
- isjunk=None 连空格都不会被视为垃圾。
- isjunk=lambda y: y == " " 将空格视为垃圾。

from difflib import SequenceMatcher
import pandas as pd

data = {'Text1': ['Performance results achieved by the approaches submitted to this Challenge.', 'Accuracy is one of the basic principles of perfectionist.'],
        'All': ['The six top approaches and three others outperform the strong baseline.', 'Where am I?']}

df = pd.DataFrame(data)

# isjunk=lambda y: y == " "
df['ratio'] = df[['Text1', 'All']].apply(lambda x: SequenceMatcher(lambda y: y == " ", x[0], x[1]).ratio(), axis=1)

# display(df)
                                                                         Text1                                                                      All     ratio
0  Performance results achieved by the approaches submitted to this Challenge.  The six top approaches and three others outperform the strong baseline.  0.356164
1                    Accuracy is one of the basic principles of perfectionist.                                                              Where am I?  0.088235

# isjunk=None
df['ratio'] = df[['Text1', 'All']].apply(lambda x: SequenceMatcher(None, x[0], x[1]).ratio(), axis=1)

# display(df)
                                                                         Text1                                                                      All     ratio
0  Performance results achieved by the approaches submitted to this Challenge.  The six top approaches and three others outperform the strong baseline.  0.410959
1                    Accuracy is one of the basic principles of perfectionist.                                                              Where am I?  0.117647

【讨论】：

非常感谢，您的解决方案完美运行，节省了我 2 个小时