【问题标题】:NLTK PerformanceNLTK 性能
【发布时间】:2012-01-28 05:47:42
【问题描述】:

好的,我最近对自然语言处理非常感兴趣:但是,到目前为止,我的大部分工作都使用 C。我听说过 NLTK,我不知道 Python,但它似乎很容易学习,而且它看起来是一门非常强大且有趣的语言。特别是,NLTK 模块似乎非常非常适合我需要做的事情。

但是,当使用sample code for NLTK 并将其粘贴到名为test.py 的文件中时,我注意到运行需要非常非常长的时间!

我是这样从 shell 调用它的:

time python ./test.py

在具有 4 GB RAM 的 2.4 GHz 机器上,需要 19.187 秒!

现在,也许这很正常,但我的印象是 NTLK 非常快;我可能弄错了,但有什么明显的地方我明显做错了吗?

【问题讨论】:

标签: python performance nlp nltk


【解决方案1】:

@Jacob 将训练和标记时间混为一谈是正确的。我稍微简化了sample code,下面是时间细分:

Importing nltk takes 0.33 secs
Training time: 11.54 secs
Tagging time: 0.0 secs
Sorting time: 0.0 secs

Total time: 11.88 secs

系统:

CPU: Intel(R) Core(TM)2 Duo CPU E8400 @ 3.00GHz
Memory: 3.7GB

代码:

import pprint, time
startstart = time.clock()

start = time.clock()
import nltk
print "Importing nltk takes", str((time.clock()-start)),"secs"

start = time.clock()
tokenizer = nltk.tokenize.RegexpTokenizer(r'\w+|[^\w\s]+')
tagger = nltk.UnigramTagger(nltk.corpus.brown.tagged_sents())
print "Training time:",str((time.clock()-start)),"secs"


text = """Mr Blobby is a fictional character who featured on Noel
Edmonds' Saturday night entertainment show Noel's House Party,
which was often a ratings winner in the 1990s. Mr Blobby also
appeared on the Jamie Rose show of 1997. He was designed as an
outrageously over the top parody of a one-dimensional, mute novelty
character, which ironically made him distinctive, absurd and popular.
He was a large pink humanoid, covered with yellow spots, sporting a
permanent toothy grin and jiggling eyes. He communicated by saying
the word "blobby" in an electronically-altered voice, expressing
his moods through tone of voice and repetition.

There was a Mrs. Blobby, seen briefly in the video, and sold as a
doll.

However Mr Blobby actually started out as part of the 'Gotcha'
feature during the show's second series (originally called 'Gotcha
Oscars' until the threat of legal action from the Academy of Motion
Picture Arts and Sciences[citation needed]), in which celebrities
were caught out in a Candid Camera style prank. Celebrities such as
dancer Wayne Sleep and rugby union player Will Carling would be
enticed to take part in a fictitious children's programme based around
their profession. Mr Blobby would clumsily take part in the activity,
knocking over the set, causing mayhem and saying "blobby blobby
blobby", until finally when the prank was revealed, the Blobby
costume would be opened - revealing Noel inside. This was all the more
surprising for the "victim" as during rehearsals Blobby would be
played by an actor wearing only the arms and legs of the costume and
speaking in a normal manner.[citation needed]"""

start = time.clock()
tokenized = tokenizer.tokenize(text)
tagged = tagger.tag(tokenized)
print "Tagging time:",str((time.clock()-start)),"secs"

start = time.clock()
tagged.sort(lambda x,y:cmp(x[1],y[1]))
print "Sorting time:",str((time.clock()-start)),"secs"

#l = list(set(tagged))
#pprint.pprint(l)
print
print "Total time:",str((time.clock()-startstart)),"secs"

【讨论】:

  • 很高兴获得事实数据要重播的代码!
【解决方案2】:

我相信您将培训时间与处理时间混为一谈。训练一个模型,比如 UnigramTagger,可能需要很多时间。因此,可以从磁盘上的 pickle 文件加载经过训练的模型。但是,一旦您将模型加载到内存中,处理速度就会非常快。请参阅我在part of speech tagging with NLTK 上的帖子底部的“分类器效率”部分,了解不同标记算法的处理速度。

【讨论】:

    猜你喜欢
    • 2013-01-15
    • 2018-01-28
    • 1970-01-01
    • 2012-03-06
    • 1970-01-01
    • 2015-12-25
    • 2015-09-02
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多