NLTK 中的 Python 一致性命令答案

【问题标题】：Python concordance command in NLTKNLTK 中的 Python 一致性命令
【发布时间】：2015-03-17 22:31:22
【问题描述】：

我对 NLTK 中的 Python concordance 命令有疑问。首先，我举了一个简单的例子：

from nltk.book import *

text1.concordance("monstrous")

效果很好。现在，我有自己的 .txt 文件，我想执行相同的命令。我有一个名为“textList”的列表，想找到“CNA”这个词，所以我输入了命令

textList.concordance('CNA')

然而，我得到了错误

AttributeError: 'list' object has no attribute 'concordance'.

在示例中，text1 不是列表吗？我想知道这里发生了什么。

【问题讨论】：

标签： python nlp nltk

【解决方案1】：

.concordance() 是一个特殊的 nltk 函数。所以你不能只在任何 python 对象上调用它（比如你的列表）。

更具体地说：.concordance() 是Text class of nltk 中的一个方法

基本上，如果你想使用.concordance()，你必须先实例化一个Text对象，然后在那个对象上调用它。

Text

文本通常从给定的文档或语料库中初始化。例如：
import nltk.corpus  
from nltk.text import Text  
moby = Text(nltk.corpus.gutenberg.words('melville-moby_dick.txt'))

.concordance()

concordance(word, width=79, lines=25)

用指定的上下文窗口打印单词的一致性。单词匹配不区分大小写。

所以我想这样的事情会起作用（未经测试）

import nltk.corpus  
from nltk.text import Text  
textList = Text(nltk.corpus.gutenberg.words('YOUR FILE NAME HERE.txt'))
textList.concordance('CNA')

【讨论】：

谢谢。我现在开始工作了。但实际上我只需要 textListNLTK = Text(textList) textListNLTK.concordance('CNA')

【解决方案2】：

我用这段代码搞定了：

import sys
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.text import Text

def main():
    if not sys.argv[1]:
        return
    # read text
    text = open(sys.argv[1], "r").read()
    tokens = word_tokenize(text)
    textList = Text(tokens)
    textList.concordance('is')
    print(tokens)



if __name__ == '__main__':
    main()

基于this site

【讨论】：

【解决方案3】：

在 Jupyter 笔记本（或 Google Colab 笔记本）中，完整过程： MS Word 文件 --> 文本文件 --> 一个 NLTK 对象：

from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.text import Text

import docx2txt

myTextFile = docx2txt.process("/mypath/myWordFile")
tokens = word_tokenize(myTextFile)
print(tokens)
textList = Text(tokens)
textList.concordance('contract')

【讨论】：