python 正则表达式可以否定单词列表吗？答案

【问题标题】：Can python regex negate a list of words?python 正则表达式可以否定单词列表吗？
【发布时间】：2011-11-30 09:04:12
【问题描述】：

我必须匹配文本中的所有字母数字单词。

>>> import re
>>> text = "hello world!! how are you?"
>>> final_list = re.findall(r"[a-zA-Z0-9]+", text)
>>> final_list
['hello', 'world', 'how', 'are', 'you']
>>>

这很好，但我还有几个词要否定，即不应该出现在我的最终列表中的词。

>>> negate_words = ['world', 'other', 'words']

一个不好的方法

>>> negate_str = '|'.join(negate_words)
>>> filter(lambda x: not re.match(negate_str, x), final_list)
['hello', 'how', 'are', 'you']

但如果我的第一个正则表达式模式可以更改以考虑否定这些词，我可以保存一个循环。我发现了字符的否定，但我有话要否定，我也在其他问题中发现了 regex-lookbehind，但这也无济于事。

用python re能做到吗？

更新

我的文字可以跨越几百行。此外，negate_words 列表也可能很长。

考虑到这一点，正在为此类任务使用正则表达式，首先正确吗？？有什么建议吗？？

【问题讨论】：

negate_words 很多吗？
@bitsMiz 是的，可以有很多否定词。而且文本也可以跨越几百行。

标签： python regex regex-negation

【解决方案1】：

我认为使用正则表达式没有一种干净的方法可以做到这一点。我能找到的最接近的有点丑，不是你想要的：

>>> re.findall(r"\b(?:world|other|words)|([a-zA-Z0-9]+)\b", text)
['hello', '', 'how', 'are', 'you']

为什么不使用 Python 的集合。它们非常快：

>>> list(set(final_list) - set(negate_words))
['hello', 'how', 'are', 'you']

如果顺序很重要，请参阅下面@glglgl 的回复。他的列表理解版本非常易读。这是一个使用itertools 的快速但可读性较差的等价物：

>>> negate_words_set = set(negate_words)
>>> list(itertools.ifilterfalse(negate_words_set.__contains__, final_list))
['hello', 'how', 'are', 'you']

另一种选择是使用re.finditer 一次性构建单词列表：

>>> result = []
>>> negate_words_set = set(negate_words)
>>> result = []
>>> for mo in re.finditer(r"[a-zA-Z0-9]+", text):
    word = mo.group()
    if word not in negate_words_set:
         result.append(word)

>>> result
['hello', 'how', 'are', 'you']

【讨论】：

值得一提的是，词序会丢失。
[i for i in final_list if i not in negate_words_set]
@raymond，啊！！你确定重新？但是无论如何+1，我绝对可以用你提到的集合替换我的过滤器功能。
@simplyharsh 我使用正则表达式添加了一个变体，但它没有完全解决并且不优雅。这个想法是让停止列表单词在描述要包含的单词的组之前匹配。

【解决方案2】：

也许值得为此尝试 pyparsing：

>>> from pyparsing import *

>>> negate_words = ['world', 'other', 'words']
>>> parser = OneOrMore(Suppress(oneOf(negate_words)) ^ Word(alphanums)).ignore(CharsNotIn(alphanums))
>>> parser.parseString('hello world!! how are you?').asList()
['hello', 'how', 'are', 'you']

请注意，oneOf(negate_words) 必须在 Word(alphanums) 之前，以确保它更早匹配。

编辑：只是为了好玩，我使用lepl（也是一个有趣的解析库）重复了这个练习

>>> from lepl import *

>>> negate_words = ['world', 'other', 'words']
>>> parser = OneOrMore(~Or(*negate_words) | Word(Letter() | Digit()) | ~Any())
>>> parser.parse('hello world!! how are you?')
['hello', 'how', 'are', 'you']

【讨论】：

【解决方案3】：

不要对正则表达式要求太多。
相反，请考虑生成器。

import re

unwanted = ('world', 'other', 'words')

text = "hello world!! how are you?"

gen = (m.group() for m in re.finditer("[a-zA-Z0-9]+",text))
li = [ w for w in gen if w not in unwanted ]

并且可以创建一个生成器来代替li，也可以

【讨论】：