如何使用python将已编号列表的段落标记为多个句子？答案

【问题标题】：How to tokenize a paragraph which have numbered list into multiple sentences using python?如何使用python将已编号列表的段落标记为多个句子？
【发布时间】：2018-10-17 08:40:39
【问题描述】：

我打算将段落分成多个句子。本段包含如下编号的句子：

Hello, How are you? Hope everything is good. I'm fine. 1.Hello World. 2.Good Morning John. 

Product is good but the managemnt is very lazy very bad. I dont like company service. They are giving fake promises. Next time i will not take any product. For Amazon service i will give 5 star dey give awsome service. But for sony company i will give 0 star... 1. Doesn't support all file formats when you connect USB 2. No other apps than YouTube and Netflix (requires subscription) 3. Screen mirroring is not up to the mark ( getting connected after once in 10 attempts 4. Good screen quality 5. Audio is very good 6. Bulky compared to other similar range 7. Price bit high due to brand value 8. its 1/4 smart TV. Not a full smart TV 9. Bad customer support 10. Remote control is very horrible to operate. it might be good for non smart TV 11. See the exchange value on amazon itself. LG gets 2ooo/- more than TV's 12. Also it was mentioned like 1+1 year warranty. But either support or Amazon support aren't clear about it. 13. Product information isn't up to 30% at least.There no installation. While I contact costumer Care.

我用下面的代码来打断句子：

import nltk
tokenizer = nltk.tokenize.punkt.PunktSentenceTokenizer()
fp = open("/Users/Desktop/sample.txt", encoding='utf-8')
data = fp.read()
with open("/Users/Desktop/output.txt", 'a', encoding='utf-8' ) as f:
            f.write(''.join(tokenizer.tokenize(data)))
            f.close()

此代码根据句号拆分段落。但是编号的句子正在制造一个问题。由于这些在数字后有句号，因此它以不正确的方式拆分。

谁能给我建议？

【问题讨论】：

是的。我尝试过使用 sent_tokenize。我可以将段落分成句子。但是，编号列表仍然面临问题。
sent_tokenize 工作正常。看我的回答。
对于第 1 段，它正确拆分，但对于第 2 段则没有

标签： python nlp nltk

【解决方案1】：

你需要sent_tokenize:

from nltk.tokenize import sent_tokenize

text = "Hello, How are you? Hope everything is good. I'm fine. 1.Hello World. 2.Good Morning John."

print(sent_tokenize(text))

输出

['Hello, How are you?', 'Hope everything is good.', "I'm fine.", '1.Hello World.', '2.Good Morning John.']

【讨论】：

我在查询中又编辑了一个段落。对于这个特定的段落，它无法拆分

【解决方案2】：

@AkshayNevrekar @fervent sent_tokenize 默认使用 PunktSentenceTokenizer，因此您应该得到相同的结果。 https://www.nltk.org/api/nltk.tokenize.html

nltk.tokenize.sent_tokenize(text, language='english')[来源]¶ 使用 NLTK 推荐的句子标记器（当前为指定语言的 PunktSentenceTokenizer）返回一个句子标记的文本副本。

也许你们两个有不同版本的 NLTK ？

根据https://www.nltk.org/api/nltk.tokenize.html#nltk.tokenize.punkt.PunktSentenceTokenizer

此分词器将文本划分为句子列表通过使用无监督算法建立缩写模型单词、搭配和开始句子的单词。肯定是在目标语言的大量纯文本上进行训练在它可以使用之前。

NLTK 数据包包括一个预训练的 Punkt 标记器，用于英语。

此模块使用机器学习算法来剪切您的文本。您使用已经训练过的分词器。如果您对结果不满意，则需要使用与您要拆分的文本相似的文本集合自己训练此标记器。将文本拆分成句子并不是一件容易的事，你可能不会 100% 满意这种算法。你需要接受一些错误，因为很难预测它的行为。

您可以尝试根据自己定义的规则实现自己的算法。举个例子（不完美，但你有预期的句子数量）：

import re
text = "Hello, How are you? Hope everything is good. I'm fine. 1.Hello World. 2.Good Morning John. Product is good but the managemnt is very lazy very bad. I dont like company service. They are giving fake promises. Next time i will not take any product. For Amazon service i will give 5 star dey give awsome service. But for sony company i will give 0 star... 1. Doesn't support all file formats when you connect USB 2. No other apps than YouTube and Netflix (requires subscription) 3. Screen mirroring is not up to the mark ( getting connected after once in 10 attempts 4. Good screen quality 5. Audio is very good 6. Bulky compared to other similar range 7. Price bit high due to brand value 8. its 1/4 smart TV. Not a full smart TV 9. Bad customer support 10. Remote control is very horrible to operate. it might be good for non smart TV 11. See the exchange value on amazon itself. LG gets 2ooo/- more than TV's 12. Also it was mentioned like 1+1 year warranty. But either support or Amazon support aren't clear about it. 13. Product information isn't up to 30% at least.There no installation. While I contact costumer Care."
print(list(re.findall('.*?[a-z].*?[0-9a-z][\?\.\!]+', text)))

使用这种算法更容易获得可预测的结果。但它在意外文本上效果不佳，因为很难找到适用于任何句子的规则。

帮助您选择解决方案：

您知道输入：尝试使用规则执行您自己的算法，并添加规则直到您对结果满意为止
您会收到意想不到的输入：NLTK 算法可能会做得更好，但您无法确定它会如何拆分您的文本。

【讨论】：