【发布时间】:2013-11-24 20:37:01
【问题描述】:
我正在使用 nltk 将文本拆分为句子单元。但是,我需要将包含引号的句子作为一个单元提取。现在每个句子,即使它在引号内,也会被提取为单独的部分。
这是我尝试将其提取为单个单元的示例:
"This is a sentence. This is also a sentence," said the cat.
现在我有这个代码:
import nltk.data
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
text = 'This is a sentence. This is also a sentence," said the cat.'
print '\n-----\n'.join(tokenizer.tokenize(text, realign_boundaries=True))
这很好用,但即使引号本身包含多个句子,我也想保留带有引号的句子。
上面的代码产生:
This is a sentence.
-----
This is also a sentence," said the cat.
我正在尝试将整个文本提取为一个单元:
"This is a sentence. This is also a sentence," said the cat.
有没有一种简单的方法可以用 nltk 来做到这一点,或者我应该使用正则表达式吗?开始使用 nltk 的简单程度给我留下了深刻的印象,但现在卡住了。
【问题讨论】:
-
你使用的是哪个分词器?
标签: python regex python-2.7 nltk