【发布时间】:2018-10-17 08:40:39
【问题描述】:
我打算将段落分成多个句子。本段包含如下编号的句子:
Hello, How are you? Hope everything is good. I'm fine. 1.Hello World. 2.Good Morning John.
Product is good but the managemnt is very lazy very bad. I dont like company service. They are giving fake promises. Next time i will not take any product. For Amazon service i will give 5 star dey give awsome service. But for sony company i will give 0 star... 1. Doesn't support all file formats when you connect USB 2. No other apps than YouTube and Netflix (requires subscription) 3. Screen mirroring is not up to the mark ( getting connected after once in 10 attempts 4. Good screen quality 5. Audio is very good 6. Bulky compared to other similar range 7. Price bit high due to brand value 8. its 1/4 smart TV. Not a full smart TV 9. Bad customer support 10. Remote control is very horrible to operate. it might be good for non smart TV 11. See the exchange value on amazon itself. LG gets 2ooo/- more than TV's 12. Also it was mentioned like 1+1 year warranty. But either support or Amazon support aren't clear about it. 13. Product information isn't up to 30% at least.There no installation. While I contact costumer Care.
我用下面的代码来打断句子:
import nltk
tokenizer = nltk.tokenize.punkt.PunktSentenceTokenizer()
fp = open("/Users/Desktop/sample.txt", encoding='utf-8')
data = fp.read()
with open("/Users/Desktop/output.txt", 'a', encoding='utf-8' ) as f:
f.write(''.join(tokenizer.tokenize(data)))
f.close()
此代码根据句号拆分段落。但是编号的句子正在制造一个问题。由于这些在数字后有句号,因此它以不正确的方式拆分。
谁能给我建议?
【问题讨论】:
-
是的。我尝试过使用 sent_tokenize。我可以将段落分成句子。但是,编号列表仍然面临问题。
-
sent_tokenize工作正常。看我的回答。 -
对于第 1 段,它正确拆分,但对于第 2 段则没有