【发布时间】:2017-02-28 18:57:07
【问题描述】:
我正在使用 Python 3.6 来查找文本中所有出现的“as”+单词+“as”,并且两边都有三个单词的上下文。
例如,如果我在上运行我的程序
"The dog was as wildly energetic as the old one. It was as bright as it has ever been."
理想的输出是
"The dog was as wildly energetic as the old one"
"one. It was as bright as it has ever"
这应该是一件容易的事,但我想不通。 (我对编程很陌生。)起初我尝试在文本的单词标记版本上执行此操作,但认为在原始字符串上使用正则表达式可能更容易。
我能想到的最好的方法是:
#FINDING __ AS __ AS __ PATTERNS
raw = "The dog was as wildly energetic as the old one. It was as bright as it has ever been."
import re
pattern_find = re.compile(r'\w* as \w* as \w*') #Here we specify the regex code.
results = pattern_find.findall(raw) #Here we do the search and put the results in a list.
print(results)
哪个输出
['was as bright as it']
完全忽略两次出现的“as”之间有两个单词的情况。这让我感到惊讶,因为我认为通过在\w 上包含星号*,它会捕获任意长的单词序列。 (似乎正在发生的事情是\w* 正在捕获任意长的连续字符字符串,而不是单词。)
我的问题是:
- 如何使用正则表达式得到我想要的?
- 有没有更好的方法来达到我想要的结果?
注意:我知道我可以使用 NLTK 的concordance() 来查找带有上下文的单个单词,但它不允许用户指定单词的模式。使用正则表达式的替代方法可能涉及从concordance() 构建一个函数。
【问题讨论】:
标签: regex python-3.x nltk