首先,如果文件是'utf8'并且你使用的是Python2,最好在io.open()中使用encoding='utf8'参数:
import io
from nltk import word_tokenize, sent_tokenize
with io.open('file.txt', 'r', encoding='utf8') as fin:
document = []
for line in fin:
tokens += [word_tokenize(sent) for sent in sent_tokenize(line)]
如果是 Python3,只需这样做:
from nltk import word_tokenize
with open('file.txt', 'r') as fin:
document = []
for line in fin:
tokens += [word_tokenize(sent) for sent in sent_tokenize(line)]
看看http://nedbatchelder.com/text/unipain.html
至于分词,如果我们假设每一行都包含某种可能由一个或多个句子组成的段落,我们希望首先初始化一个列表来存储整个文档:
document = []
然后我们遍历行并将行拆分成句子:
for line in fin:
sentences = sent_tokenize(line)
然后我们将句子拆分为标记:
token = [word_tokenize(sent) for sent in sent_tokenize(line)]
由于我们要更新文档列表以存储标记化的句子,我们使用:
document = []
for line in fin:
tokens += [word_tokenize(sent) for sent in sent_tokenize(line)]
不推荐!!!(但仍然可以一行):
alvas@ubi:~$ cat file.txt
this is a paragph. with many sentences.
yes, hahaah.. wahahha...
alvas@ubi:~$ python
Python 2.7.11+ (default, Apr 17 2016, 14:00:29)
[GCC 5.3.1 20160413] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import io
>>> from itertools import chain
>>> from nltk import sent_tokenize, word_tokenize
>>> list(chain(*[[word_tokenize(sent) for sent in sent_tokenize(line)] for line in io.open('file.txt', 'r', encoding='utf8')]))
[[u'this', u'is', u'a', u'paragph', u'.'], [u'with', u'many', u'sentences', u'.'], [u'yes', u',', u'hahaah..', u'wahahha', u'...']]