如何使用 NLTK 和 Python 从文本中删除自定义单词模式答案

【问题标题】：How to remove a custom word pattern from a text using NLTK with Python如何使用 NLTK 和 Python 从文本中删除自定义单词模式
【发布时间】：2015-08-22 00:07:08
【问题描述】：

我目前正在从事一个分析质量试卷问题的项目。在这里，我使用 Python 3.4 和 NLTK。
所以首先我想把每个问题从正文中单独拿出来。试卷格式如下。

 (Q1). What is web 3.0?
 (Q2). Explain about blogs.
 (Q3). What is mean by semantic web?
       and so on ........

所以现在我想在没有问题编号的情况下一一提取问题（问题编号格式始终与上面给出的相同）。所以我的结果应该是这样的。

 What is web 3.0?
 Explain about blogs.
 What is mean by semantic web?

那么如何使用带有 NLTK 的 python 3.4 解决这个问题呢？
谢谢

【问题讨论】：

为什么需要 NLTK？看起来你可以通过一个简单的正则表达式来删除它。
是的先生。我正在使用 NLTK 进行进一步分析。我不知道这项工作是否需要 NLTK。无论如何，你能告诉我如何使用正则表达式来做到这一点吗？
使用 re.sub: docs.python.org/2/library/re.html#re.sub

标签： python regex nlp nltk tokenize

【解决方案1】：

您可能需要检测包含问题的行，然后提取问题并删除问题编号。检测问题标签的正则表达式是

qnum_pattern = r"^\s*\(Q\d+\)\.\s+"

你可以用它来抽出这样的问题：

questions = [ re.sub(qnum_pattern, "", line) for line in text if 
                                            re.search(qnum_pattern, line) ]

显然，text 必须是行列表或打开以供读取的文件。

但是，如果您不知道如何处理此问题，那么您的工作将与其余的作业一起完成。我建议花一些时间在 python 教程或其他介绍性材料上。

【讨论】：

非常感谢您的热情回复

【解决方案2】：

如果每个句子都以这种模式开头，你要求的很容易解析，你可以使用split去掉这个前缀：

sentences = [ "(Q1). What is web 3.0?",
              "(Q2). Explain about blogs.",
              "(Q3). What is mean by semantic web?"]
for sen in sentences:
    print sen.split('). ',1)[1]

这将打印：

What is web 3.0?
Explain about blogs.
What is mean by semantic web?

【讨论】：

【解决方案3】：

如果(QX)在文本前总是用空格隔开，你可以这样做：

>>> text = """(Q1). What is web 3.0?
...  (Q2). Explain about blogs.
...  (Q3). What is mean by semantic web?"""
>>> for line in text.split('\n'):
...     print line.strip().partition(' ')[2]
... 
What is web 3.0?
Explain about blogs.
What is mean by semantic web?

【讨论】：