使用正则表达式时缺少句子中的最后一个单词答案

【问题标题】：Missing last word in a sentence when using regular expression使用正则表达式时缺少句子中的最后一个单词
【发布时间】：2018-09-16 17:53:12
【问题描述】：

代码：

import re

def main():
    a=['the mississippi is well worth reading about', ' it is not a commonplace river, but on the contrary is in all ways remarkable']
    b=word_find(a)
    print(b)

def word_find(sentence_list):
    word_list=[]
    word_reg=re.compile(r"[\(|\)|,|\'|\"|:|\[|\]|\{|\}| |\-\-+|\t|;]?(.+?)[\(|\)|,|\'|\"|:|\[|\]|\{|\}| |\-\-+|\t|;]")
    for i in range(len(sentence_list)):
        words=re.findall(word_reg,sentence_list[i])
        word_list.append(words)
    return word_list

main()

我需要将每个单词分解为列表的单个元素

现在输出如下所示：

[['the', 'mississippi', 'is', 'well', 'worth', 'reading'], ['it', 'is', 'not', 'a', 'commonplace', 'river', 'but', 'on', 'the', 'contrary', 'is', 'in', 'all', 'ways']]

发现第一句'about'和第二句'remarkable'的最后一个字不见了

我的正则表达式可能有问题

word_reg=re.compile(r"[\(|\)|,|\'|\"|:|\[|\]|\{|\}| |\-\-+|\t|;]?(.+?)[\(|\)|,|\'|\"|:|\[|\]|\{|\}| |\-\-+|\t|;]")

但是如果我像这样在这个正则表达式的最后部分添加一个问号：

[\(|\)|,|\'|\"|:|\[|\]|\{|\}| |\-\-+|\t|;]**?**")

结果变成许多单个字母而不是单词。我能用它做什么？

编辑：

我没有使用 string.split 的原因是人们可能有很多断词的方法

例如：当人们输入a--b时，没有空格，但我们必须将其分解为'a'，'b'

【问题讨论】：

你有什么理由不想像string.split(' ')那样在空格上分割字符串？
我编辑了这个问题来解释为什么不是 string.split(" ")

标签： python regex

【解决方案1】：

使用正确的工具始终是制胜法宝。在您的情况下，正确的工具是 NLTK 单词标记器，因为它旨在做到这一点：将句子分解成单词。

import nltk
a = ['the mississippi is well worth reading about', 
     ' it is not a commonplace river, but on the contrary is in all ways remarkable']
nltk.word_tokenize(a[1])
#['it', 'is', 'not', 'a', 'commonplace', 'river', ',', 'but', 
# 'on', 'the', 'contrary', 'is', 'in', 'all', 'ways', 'remarkable']

【讨论】：

【解决方案2】：

建议一个更简单的解决方案：

b = re.split(r"[\W_]", a)

正则表达式[\W_] 匹配任何单个非单词字符（非字母、非数字和非下划线）加上下划线，这实际上就足够了。

您当前的正则表达式要求单词后跟列表中的字符之一，但不是“行尾”，可以与$ 匹配。

【讨论】：

【解决方案3】：

您可以使用re.split 和filter：

filter(None, re.split("[, \-!?:]+", a])

在我放置字符串"[, \-!?:]+" 的地方，您应该放置作为分隔符的任何字符。 filter 会因为前导/尾随分隔符而删除任何空字符串。

【讨论】：

【解决方案4】：

你可以找到你不想要的东西然后分开：

>>> a=['the mississippi is well worth reading about', ' it is not a commonplace river, but on the contrary is in all ways remarkable']
>>> [re.split(r'\W+', s) for s in a]
[['the', 'mississippi', 'is', 'well', 'worth', 'reading', 'about'], ['', 'it', 'is', 'not', 'a', 'commonplace', 'river', 'but', 'on', 'the', 'contrary', 'is', 'in', 'all', 'ways', 'remarkable']]

（您可能需要过滤re.split产生的''元素）

或者使用re.findall 捕获您想要的内容并保留这些元素：

>>> [re.findall(r'\b\w+', s) for s in a]
[['the', 'mississippi', 'is', 'well', 'worth', 'reading', 'about'], ['it', 'is', 'not', 'a', 'commonplace', 'river', 'but', 'on', 'the', 'contrary', 'is', 'in', 'all', 'ways', 'remarkable']]

【讨论】：

【解决方案5】：

谢谢大家

从其他人的答案来看，解决方案是使用 re.split()

并且在最上面的答案中有一个 SUPER STAR NLTK

def word_find(sentence_list):
    word_list=[]
    for i in range(len(sentence_list)):
        word_list.append(re.split('\(|\)|,|\'|\"|:|\[|\]|\{|\}| |\-\-+|\t|;',sentence_list[i]))
    return word_list

【讨论】：

没必要用那么多|（或），试试这个[(),'":\[\]{} \t;-]+