【问题标题】：NLTK tokenize - faster way?NLTK 标记化 - 更快的方式？
【发布时间】：2017-06-14 04:53:51
【问题描述】：

我有一个接收字符串参数的方法，并使用 NLTK 将字符串分解为句子，然后分解为单词。然后将每个单词转换为小写，最后创建每个单词出现频率的字典。

import nltk
from collections import Counter

def freq(string):
    f = Counter()
    sentence_list = nltk.tokenize.sent_tokenize(string)
    for sentence in sentence_list:
        words = nltk.word_tokenize(sentence)
        words = [word.lower() for word in words]
        for word in words:
            f[word] += 1
    return f

我应该进一步优化上述代码以缩短预处理时间，但我不确定如何去做。返回值显然应该和上面的完全一样，所以我应该使用 nltk 虽然没有明确要求这样做。

有什么方法可以加快上述代码的速度吗？谢谢。

【问题讨论】：

关于改进工作代码的问题在 Stack Overflow 上通常被认为是题外话。您可以通过 Code Review 获得帮助。
哦，好的。很高兴知道。我会尝试在那里问。
我投票结束这个问题，因为“这个问题属于 SE 网络中的另一个站点”codereview.stackexchange.com
@lenz Questions about performance are on-topic on either site。 SO 往往会给出更高质量的答案。

标签： python time-complexity nltk tokenize frequency

【解决方案1】：

如果您只想要一个简单的令牌列表，请注意word_tokenize 会隐式调用sent_tokenize，请参阅https://github.com/nltk/nltk/blob/develop/nltk/tokenize/init.py#L98

_treebank_word_tokenize = TreebankWordTokenizer().tokenize
def word_tokenize(text, language='english'):
    """
    Return a tokenized copy of *text*,
    using NLTK's recommended word tokenizer
    (currently :class:`.TreebankWordTokenizer`
    along with :class:`.PunktSentenceTokenizer`
    for the specified language).
    :param text: text to split into sentences
    :param language: the model name in the Punkt corpus
    """
    return [token for sent in sent_tokenize(text, language)
            for token in _treebank_word_tokenize(sent)]

以棕色语料库为例，用Counter(word_tokenize(string_corpus))：

>>> from collections import Counter
>>> from nltk.corpus import brown
>>> from nltk import sent_tokenize, word_tokenize
>>> string_corpus = brown.raw() # Plaintext, str type.
>>> start = time.time(); fdist = Counter(word_tokenize(string_corpus)); end = time.time() - start
>>> end
12.662328958511353
>>> fdist.most_common(5)
[(u',', 116672), (u'/', 89031), (u'the/at', 62288), (u'.', 60646), (u'./', 48812)]
>>> sum(fdist.values())
1423314

~140 万字在我的机器上花费了 12 秒（不保存标记化语料库）：

alvas@ubi:~$ cat /proc/cpuinfo
processor   : 0
vendor_id   : GenuineIntel
cpu family  : 6
model       : 69
model name  : Intel(R) Core(TM) i5-4200U CPU @ 1.60GHz
stepping    : 1
microcode   : 0x17
cpu MHz     : 1600.027
cache size  : 3072 KB
physical id : 0
siblings    : 4
core id     : 0
cpu cores   : 2

$ cat /proc/meminfo
MemTotal:       12004468 kB

先保存标记化的语料库tokenized_corpus = [word_tokenize(sent) for sent in sent_tokenize(string_corpus)]，然后使用Counter(chain*(tokenized_corpus))：

>>> from itertools import chain
>>> start = time.time(); tokenized_corpus = [word_tokenize(sent) for sent in sent_tokenize(string_corpus)]; fdist = Counter(chain(*tokenized_corpus)); end = time.time() - start
>>> end
16.421464920043945

使用ToktokTokenizer()

>>> from collections import Counter
>>> import time
>>> from itertools import chain
>>> from nltk.corpus import brown
>>> from nltk import sent_tokenize, word_tokenize
>>> from nltk.tokenize import ToktokTokenizer
>>> toktok = ToktokTokenizer()
>>> string_corpus = brown.raw()

>>> start = time.time(); tokenized_corpus = [toktok.tokenize(sent) for sent in sent_tokenize(string_corpus)]; fdist = Counter(chain(*tokenized_corpus)); end = time.time() - start 
>>> end
10.00472116470337

使用MosesTokenizer():

>>> from nltk.tokenize.moses import MosesTokenizer
>>> moses = MosesTokenizer()
>>> start = time.time(); tokenized_corpus = [moses.tokenize(sent) for sent in sent_tokenize(string_corpus)]; fdist = Counter(chain(*tokenized_corpus)); end = time.time() - start 
>>> end
30.783339023590088
>>> start = time.time(); tokenized_corpus = [moses.tokenize(sent) for sent in sent_tokenize(string_corpus)]; fdist = Counter(chain(*tokenized_corpus)); end = time.time() - start 
>>> end
30.559681177139282

为什么使用MosesTokenizer

它的实现方式是有一种方法可以将标记反转回字符串，即“detokenize”。

>>> from nltk.tokenize.moses import MosesTokenizer, MosesDetokenizer
>>> t, d = MosesTokenizer(), MosesDetokenizer()
>>> sent = "This ain't funny. It's actually hillarious, yet double Ls. | [] < > [ ] & You're gonna shake it off? Don't?"
>>> expected_tokens = [u'This', u'ain', u'&apos;t', u'funny.', u'It', u'&apos;s', u'actually', u'hillarious', u',', u'yet', u'double', u'Ls.', u'&#124;', u'&#91;', u'&#93;', u'&lt;', u'&gt;', u'&#91;', u'&#93;', u'&amp;', u'You', u'&apos;re', u'gonna', u'shake', u'it', u'off', u'?', u'Don', u'&apos;t', u'?']
>>> expected_detokens = "This ain't funny. It's actually hillarious, yet double Ls. | [] < > [] & You're gonna shake it off? Don't?"
>>> tokens = t.tokenize(sent)
>>> tokens == expected_tokens
True
>>> detokens = d.detokenize(tokens)
>>> " ".join(detokens) == expected_detokens
True

使用ReppTokenizer()：

>>> repp = ReppTokenizer('/home/alvas/repp')
>>> start = time.time(); sentences = sent_tokenize(string_corpus); tokenized_corpus = repp.tokenize_sents(sentences); fdist = Counter(chain(*tokenized_corpus)); end = time.time() - start
>>> end
76.44129395484924

为什么要使用ReppTokenizer？

它返回原始字符串中标记的偏移量。

>>> sents = ['Tokenization is widely regarded as a solved problem due to the high accuracy that rulebased tokenizers achieve.' ,
... 'But rule-based tokenizers are hard to maintain and their rules language specific.' ,
... 'We evaluated our method on three languages and obtained error rates of 0.27% (English), 0.35% (Dutch) and 0.76% (Italian) for our best models.'
... ]
>>> tokenizer = ReppTokenizer('/home/alvas/repp/') # doctest: +SKIP
>>> for sent in sents:                             # doctest: +SKIP
...     tokenizer.tokenize(sent)                   # doctest: +SKIP
... 
(u'Tokenization', u'is', u'widely', u'regarded', u'as', u'a', u'solved', u'problem', u'due', u'to', u'the', u'high', u'accuracy', u'that', u'rulebased', u'tokenizers', u'achieve', u'.')
(u'But', u'rule-based', u'tokenizers', u'are', u'hard', u'to', u'maintain', u'and', u'their', u'rules', u'language', u'specific', u'.')
(u'We', u'evaluated', u'our', u'method', u'on', u'three', u'languages', u'and', u'obtained', u'error', u'rates', u'of', u'0.27', u'%', u'(', u'English', u')', u',', u'0.35', u'%', u'(', u'Dutch', u')', u'and', u'0.76', u'%', u'(', u'Italian', u')', u'for', u'our', u'best', u'models', u'.')
>>> for sent in tokenizer.tokenize_sents(sents): 
...     print sent                               
... 
(u'Tokenization', u'is', u'widely', u'regarded', u'as', u'a', u'solved', u'problem', u'due', u'to', u'the', u'high', u'accuracy', u'that', u'rulebased', u'tokenizers', u'achieve', u'.')
(u'But', u'rule-based', u'tokenizers', u'are', u'hard', u'to', u'maintain', u'and', u'their', u'rules', u'language', u'specific', u'.')
(u'We', u'evaluated', u'our', u'method', u'on', u'three', u'languages', u'and', u'obtained', u'error', u'rates', u'of', u'0.27', u'%', u'(', u'English', u')', u',', u'0.35', u'%', u'(', u'Dutch', u')', u'and', u'0.76', u'%', u'(', u'Italian', u')', u'for', u'our', u'best', u'models', u'.')
>>> for sent in tokenizer.tokenize_sents(sents, keep_token_positions=True): 
...     print sent
... 
[(u'Tokenization', 0, 12), (u'is', 13, 15), (u'widely', 16, 22), (u'regarded', 23, 31), (u'as', 32, 34), (u'a', 35, 36), (u'solved', 37, 43), (u'problem', 44, 51), (u'due', 52, 55), (u'to', 56, 58), (u'the', 59, 62), (u'high', 63, 67), (u'accuracy', 68, 76), (u'that', 77, 81), (u'rulebased', 82, 91), (u'tokenizers', 92, 102), (u'achieve', 103, 110), (u'.', 110, 111)]
[(u'But', 0, 3), (u'rule-based', 4, 14), (u'tokenizers', 15, 25), (u'are', 26, 29), (u'hard', 30, 34), (u'to', 35, 37), (u'maintain', 38, 46), (u'and', 47, 50), (u'their', 51, 56), (u'rules', 57, 62), (u'language', 63, 71), (u'specific', 72, 80), (u'.', 80, 81)]
[(u'We', 0, 2), (u'evaluated', 3, 12), (u'our', 13, 16), (u'method', 17, 23), (u'on', 24, 26), (u'three', 27, 32), (u'languages', 33, 42), (u'and', 43, 46), (u'obtained', 47, 55), (u'error', 56, 61), (u'rates', 62, 67), (u'of', 68, 70), (u'0.27', 71, 75), (u'%', 75, 76), (u'(', 77, 78), (u'English', 78, 85), (u')', 85, 86), (u',', 86, 87), (u'0.35', 88, 92), (u'%', 92, 93), (u'(', 94, 95), (u'Dutch', 95, 100), (u')', 100, 101), (u'and', 102, 105), (u'0.76', 106, 110), (u'%', 110, 111), (u'(', 112, 113), (u'Italian', 113, 120), (u')', 120, 121), (u'for', 122, 125), (u'our', 126, 129), (u'best', 130, 134), (u'models', 135, 141), (u'.', 141, 142)]

TL;DR

不同分词器的优势

word_tokenize() 隐式调用 sent_tokenize()
ToktokTokenizer() 最快
MosesTokenizer() 能够解除文本标记
ReppTokenizer() 能够提供令牌偏移量

问：有没有一个快速的分词器，可以去分词，也可以为我提供偏移量，还可以在 NLTK 中进行句子分词？

A：我不这么认为，试试gensim 或spacy。

【讨论】：

是否可以只返回句子标记而不是使用 toktok 的单词标记？也就是说，而不是单词标记？它还会比 sent_tokenize() 快吗？我可以测试后者。谢谢。
顺便说一句，MosesTokenizer 已移至github.com/alvations/sacremoses/tree/master/sacremoses

【解决方案2】：

不必要的列表创建是邪恶的

您的代码是隐含的creating a lot of potentially very long list instances，不需要在那里，例如：

words = [word.lower() for word in words]

使用list comprehension 的[...] 语法为输入中找到的n 个标记创建一个长度为n 的列表，但您要做的就是获取每个令牌的频率，而不是实际存储它们：

f[word] += 1

因此，您应该改用generator：

words = (word.lower() for word in words)

同样，nltk.tokenize.sent_tokenize and nltk.tokenize.word_tokenize 似乎都生成列表作为输出，这又是不必要的；尝试使用更底层的函数，例如nltk.tokenize.api.StringTokenizer.span_tokenize，它仅生成一个迭代器，为您的输入流生成令牌偏移量，即代表每个令牌的输入字符串的索引对。

更好的解决方案

这是一个不使用中间列表的示例：

def freq(string):
    '''
    @param string: The string to get token counts for. Note that this should already have been normalized if you wish it to be so.
    @return: A new Counter instance representing the frequency of each token found in the input string.
    '''
    spans = nltk.tokenize.WhitespaceTokenizer().span_tokenize(string)   
    # Yield the relevant slice of the input string representing each individual token in the sequence
    tokens = (string[begin : end] for (begin, end) in spans)
    return Counter(tokens)

免责声明：我没有对此进行分析，因此有可能例如NLTK 的人让word_tokenize 变得非常快但被忽视了span_tokenize；始终分析您的应用程序以确保。

TL;DR

当生成器足够时不要使用列表：每次创建列表只是为了在使用一次后将其丢弃，上帝会杀死一只小猫。

【讨论】：

实际上列表理解中的word.lower() 不是一个好建议。用户应该从一开始就降低了输入字符串。 IE。 sent_tokenize(string.lower())。那么就没有必要多次调用str.lower()，但是str.lower() 应该足够快，因为列表并不是真的很庞大=）
无论如何，创建任何大小的列表都是不必要的开销；我的回答与str.lower()的表现无关。
我理解，但实际上创建标记化的句子对于以后的其他 NLP 使用来说相当重要，例如计算 TF-IDF，您需要文档和句子结构；例如运行句子级分类器、主题模型等。
你写的与OP的要求无关，这似乎是由自己以外的人设置的：我应该进一步优化上面的代码 导致更快的预处理时间，我不确定如何这样做。返回值显然应该与上面的完全相同，所以我希望使用 nltk 虽然没有明确要求这样做 - 提到例如“计算TF-IDF”找不到了，还是我错了？

【解决方案3】：

除了上述标记器之外，wordpunct_tokenize 还为我完成了这项工作。这尤其适用于文本相似性任务。我用这个函数替换了jieba.lcut(s)，以获得更快的速度和相同的准确性。

from nltk.tokenize import wordpunct_tokenize
s = '''Good muffins cost $3.88\nin New York.  Please buy me two of them.\n\nThanks.'''
>>> wordpunct_tokenize(s)
>>> ['Good', 'muffins', 'cost', '$', '3', '.', '88', 'in', 'New', 'York', '.',
    'Please', 'buy', 'me', 'two', 'of', 'them', '.', 'Thanks', '.']

Link 用于文档。

【讨论】：