你不能。至少,不是以简单或标准化的方式。
即使是基础知识,例如您如何定义“单词”,也比您想象的要复杂得多。很多。单词解析和词汇接近度(例如“两个单词在句子 s 中的距离 D 内吗?”)都是 natural language processing (NLP) 的领域。 NLP 和邻近搜索不是基本 Pandas 的一部分,也不是 Python 标准字符串处理的一部分。您可以导入 NLTK, the Natural Language Toolkit 之类的东西以一般方式解决此问题,但那完全是另一回事了。
让我们看一个简单的方法。首先,您需要一种将字符串解析为单词的方法。按照 NLP 标准,以下是粗略的,但适用于更简单的情况:
def parse_words(s):
"""
Simple parser to grab English words from string.
CAUTION: A simplistic solution to a hard problem.
Many possibly-important edge- and corner-cases
not handled. Just one example: Hyphenated words.
"""
return re.findall(r"\w+(?:'[st])?", s, re.I)
例如:
>>> parse_words("and don't think this day's last moment won't come ")
['and', "don't", 'think', 'this', "day's", 'last', 'moment', "won't", 'come']
那么您需要一种方法来查找列表中找到目标词的所有索引:
def list_indices(target, seq):
"""
Return all indices in seq at which the target is found.
"""
indices = []
cursor = 0
while True:
try:
index = seq.index(target, cursor)
except ValueError:
return indices
else:
indices.append(index)
cursor = index + 1
最后是一个决策包装器:
def words_within(target_words, s, max_distance, case_insensitive=True):
"""
Determine if the two target words are within max_distance positiones of one
another in the string s.
"""
if len(target_words) != 2:
raise ValueError('must provide 2 target words')
# fold case for case insensitivity
if case_insensitive:
s = s.casefold()
target_words = [tw.casefold() for tw in target_words]
# for Python 2, replace `casefold` with `lower`
# parse words and establish their logical positions in the string
words = parse_words(s)
target_indices = [list_indices(t, words) for t in target_words]
# words not present
if not target_indices[0] or not target_indices[1]:
return False
# compute all combinations of distance for the two words
# (there may be more than one occurance of a word in s)
actual_distances = [i2 - i1 for i2 in target_indices[1] for i1 in target_indices[0]]
# answer whether the minimum observed distance is <= our specified threshold
return min(actual_distances) <= max_distance
那么:
>>> s = "and don't think this day's last moment won't come at last"
>>> words_within(["THIS", 'last'], s, 2)
True
>>> words_within(["think", 'moment'], s, 2)
False
剩下要做的就是将其映射回 Pandas:
df = pd.DataFrame({'desc': [
'My latest Data job was an Analyst',
'some day my prince will come',
'Oh, somewhere over the rainbow bluebirds fly',
"Won't you share a common disaster?",
'job! rainbow! analyst.'
]})
df['ja2'] = df.desc.apply(lambda x: words_within(["job", 'analyst'], x, 2))
df['ja3'] = df.desc.apply(lambda x: words_within(["job", 'analyst'], x, 3))
这基本上就是您解决问题的方法。请记住,这是一个粗略而简单的解决方案。一些简单提出的问题不是简单回答的。 NLP 问题经常在其中。