【发布时间】:2020-04-02 11:30:25
【问题描述】:
由于我在互联网上找不到任何解决方案,我只想在这里问我的问题。
我想在每个标点符号处拆分给定的文本。因此,不仅在每个句子之后,而且在例如逗号之后。到目前为止,我遇到了自然语言工具包 (tltk) 和正则表达式,但我没有成功。
这个效果很好,但并没有完全满足我的期望:
sample_text = """With this example I wanna make the point clear... I hope you get it! There are many coding
languages out there, but which is the best? I would say there's no best. Change my mind - if you can!"""
split_text = nltk.tokenize.sent_tokenize(sample_text)
print(split_text)
#Output: ['With this example I wanna make the point clear...', 'I hope you get it!', 'There are many coding languages out there, but which is the best?', "I would say there's no best.", 'Change my mind - if you can!']
这已经很好了,但我最好希望收到一个输出,它甚至可以将文本拆分为逗号或连字符。所以输出看起来像这样:
[
'With this example I wanna make the point clear...',
'I hope you get it!',
'There are many coding languages out there,',
'but which is the best?',
"I would say there's no best.",
'Change my mind -',
'if you can!'
]
使用正则表达式可能会更好,不是吗?但不知何故,我没有得到它的工作。 在此先感谢,感谢任何帮助!
【问题讨论】:
-
试试
re.findall(r"\w['\w\s]*[^'\w\s]*", sample_text)。我认为您需要从标点符号中排除'和_。请参阅Python demo。