nltk中的三元词组，二元词组

在做英文文本处理时，常常会遇到这样的情况，需要我们提取出里面的词组进行主题抽取，尤其是具有行业特色的，比如金融年报等。其中主要进行的是进行双连词和三连词的抽取，那如何进行双连词和三连词的抽取呢？这是本文将要介绍的具体内容。

1. nltk.bigrams(tokens) 和 nltk.trigrams(tokens)

一般如果只是要求穷举双连词或三连词，则可以直接用nltk中的函数bigrams()或trigrams()，效果如下面代码：

 1 >>> import nltk
 2 >>> str='you are my sunshine, and all of things are so beautiful just for you.'
 3 >>> tokens=nltk.wordpunct_tokenize(str)
 4 >>> bigram=nltk.bigrams(tokens)
 5 >>> bigram
 6 <generator object bigrams at 0x025C1C10>
 7 >>> list(bigram)
 8 [('you', 'are'), ('are', 'my'), ('my', 'sunshine'), ('sunshine', ','), (',', 'and'), ('and', 'all'), ('all', 'of'), ('of', 'things'), ('things', 'are'), ('are', 'so'), ('so', 'beautiful'), ('beautiful
 9 ', 'just'), ('just', 'for'), ('for', 'you'), ('you', '.')]
10 >>> trigram=nltk.trigrams(tokens)
11 >>> list(trigram)
12 [('you', 'are', 'my'), ('are', 'my', 'sunshine'), ('my', 'sunshine', ','), ('sunshine', ',', 'and'), (',', 'and', 'all'), ('and', 'all', 'of'), ('all', 'of', 'things'), ('of', 'things', 'are'), ('thin
13 gs', 'are', 'so'), ('are', 'so', 'beautiful'), ('so', 'beautiful', 'just'), ('beautiful', 'just', 'for'), ('just', 'for', 'you'), ('for', 'you', '.')]

View Code