【问题标题】:Find sentence containing certain expression with regex使用正则表达式查找包含特定表达式的句子
【发布时间】:2018-11-23 10:59:05
【问题描述】:

这是一个关于编程的学校项目,我应该只使用重新导入。

我试图在包含由参数定义的特定表达式的文本文件中查找所有句子并将它们提取到列表中。搜索其他帖子让我找到了句子开头和结尾的点,但如果那里有一个带点的数字,它会破坏结果。

如果我有一个 txt:This is a text. I dont want for the result to stop in the number 990.576, I want to extract the phrase with this expression. Its not working.

search = re.findall(r"([^.]*?"+expression+"[^.]*\.", txt)

我得到的结果是['576, I want to extract the phrase with this expression',]

我想要的结果是['I dont want for the result to stop in the number 990.576, I want to extract the phrase with this expression.']

我还是初学者,有什么帮助吗?

【问题讨论】:

  • 首先搜索数字之间的点,用逗号替换。然后拆分您的文本并在生成的短语中,再次查找带逗号的数字并将该逗号替换为一个点。

标签: python regex findall


【解决方案1】:

如果我没记错的话,你想拆分句子。为此目的,最好的正则表达式是这样的:

sentences = re.split(r' *[\.\?!][\'"\)\]]* *', txt)

如果这不起作用。您可以通过此正则表达式将句子中的多余点替换为逗号:

txt = re.sub(r'(\d*)\.(\d+)', r'\1,\2', txt)

【讨论】:

    【解决方案2】:

    Tokenize the text into sentences with NLTK,然后使用全词搜索或常规子字符串检查。

    全词搜索示例:

    import nltk, re
    text = "This is a text. I dont want for the result to stop in the number 990.576, I want to extract the phrase with this expression. Its not working."
    sentences = nltk.sent_tokenize(text)
    word = "expression"
    print([sent for sent in sentences if re.search(r'\b{}\b'.format(word), sent)])
    # => ['I dont want for the result to stop in the number 990.576, I want to extract the phrase with this expression.']
    

    如果您不需要全词搜索,请将 if re.search(r'\b{}\b'.format(word), sent) 替换为 if word in sent

    【讨论】:

      【解决方案3】:

      也许不是最好的解决方案,但您可以匹配文本中的所有句子,然后找到表达式,如下所示:

      sentences = re.split(r'(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s', text)
      
      matching = [s for s in sentences if "I want to extract the phrase with this expression" in s]
      
      print(matching)
      
      #Result:
      # ['I dont want for the result to stop in the number 990.576, I want to extract the phrase with this expression.']
      

      希望对你有帮助!

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 2012-07-10
        • 2016-12-07
        相关资源
        最近更新 更多