如何在 Python 中查找字符串中最接近的单词答案

【问题标题】：How to find closest words in a string in Python如何在 Python 中查找字符串中最接近的单词
【发布时间】：2021-11-12 02:17:51
【问题描述】：

我正在抓取一个长文档的文本，其中同一个代码在整个过程中重复了多次。我正在尝试找到最接近另一个单词的特定代码（我们称其为基本单词。这是代码示例：

ticker = 'TRBCX'
base_word = goal

string = 'TRBCX fund is up 2% today. TRBCX investment goal is to beat the S&P 500. TRBCX is managed by investment manager John Smith'

我正在尝试找到一种方法来获取第二个 TRBCX 和基本词“目标”周围/之间的文本。所以基本上我想抓住一个看起来像这样的短语并给它一个名字：

''' code to find words around  ticker and baseword ''' = identifier
print(identifier)
output: 'TRBCX investment goal' or 'today. TRBCX investment goal is'

我将使用文本块（标识符）来标识一个新部分。我感兴趣的股票代码的位置每次都不一样。非常感谢你的帮助。我知道这可能看起来令人困惑。

【问题讨论】：

一种非常暴力的方法是为字符串中的每个单词运行一个 levenshtein 距离，然后从最小到最大排序。
FuzzyWuzzy 似乎是事实，直到您进入更大的项目或特定需求，然后 nltk 可能是您的首选，具体取决于项目需求。但这些只是几个图书馆中的两个。也许查看Awesome Python 会有所帮助。

标签： python match scrape closest

【解决方案1】：

我相信最好的方法是使用 Levenshtein 距离，它允许我们比较字符串并找出它们的相似程度。这将允许您通过对象度量找到哪些单词彼此最接近。

根据您的示例：

!pip install python-Levenshtein
from Levenshtein import distance as lev
ticker = 'TRBCX'
string = 'TRBCX fund is up 2% today. TRBCX investment goal is to beat the S&P 500. TRBCX is managed by investment manager John Smith'
distances = {key:lev(ticker,key) for key in string.split()}

{'2%': 5,
 '500.': 5,
 'John': 5,
 'S&P': 5,
 'Smith': 5,
 'TRBCX': 0,
 'beat': 5,
 'by': 5,
 'fund': 5,
 'goal': 5,
 'investment': 10,
 'is': 5,
 'managed': 7,
 'manager': 7,
 'the': 5,
 'to': 5,
 'today.': 6,
 'up': 5}

【讨论】：