计算词频并写入输出文件答案

【问题标题】：Counting word frequency and writing in an output file计算词频并写入输出文件
【发布时间】：2016-05-24 21:07:02
【问题描述】：

从 file_test.txt 我需要使用 nltk.FreqDist() 函数计算每个单词在文件中出现的次数。当我计算词频时，我需要查看该词是否在 pos_dict.txt 中，如果是，则将词频数乘以 pos_dict.txt 中相同单词的数字。

file_test.txt 看起来像这样：

  abandon, abandon, calm, clear

pos_dict.txt 在这些词中看起来像这样：

"abandon":2,"calm":2,"clear":1,...

我的代码是：

from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
import nltk

f_input_pos=open('file_test.txt','r').read()

def features_pos(dat):
    tokens = nltk.word_tokenize(dat)
    fdist=nltk.FreqDist(tokens)

    f_pos_dict=open('pos_dict.txt','r').read()
    f=f_pos_dict.split(',') 

    for part in f:
        b=part.split(':')
        c=b[-1]   #to catch the number
        T2 = eval(str(c).replace("'","")) # convert number from string to int

        for word in fdist:
            if word in f_pos_dict:
               d=fdist[word]
               print(word,'->',d*T2)


features_pos(f_input_pos)

所以我的输出需要是这样的：

abandon->4
calm->2
clear->1

但是我的输出复制了所有输出并且显然乘法错误。我有点卡住了，我不知道错误在哪里，可能我使用的 for 循环错误。如果有人可以提供帮助，我将不胜感激:)

【问题讨论】：

您的输入文件是什么样的？你能发布一个链接或file_test.txt 和pos_dict.txt 的示例吗？
我的输入文件file_test.txt 看起来和我在问题中写的一样，在pos_dict.txt 中包含其他词，但对于理解并不重要。

标签： python-3.x nltk

【解决方案1】：

首先，这是一种快速读取pos_dict.txt 文件的方法，方法是将其读取为字典的字符串表示形式：

alvas@ubi:~$ echo '"abandon":2,"calm":2,"clear":1' > pos_dict.txt
alvas@ubi:~$ python
Python 2.7.11+ (default, Apr 17 2016, 14:00:29) 
[GCC 5.3.1 20160413] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import io
>>> with io.open('pos_dict.txt', 'r') as fin:
...     pos_dict = eval("{" + fin.read() + "}")
... 
>>>
>>> pos_dict['abandon']
2
>>> pos_dict['clear']
1

接下来，要读取您的file_test.txt，我们必须读取文件，去掉标题和尾随空格，然后用,（逗号后跟空格）分隔单词。

然后使用collections.Counter 对象，我们可以轻松获取令牌计数（另请参阅Difference between Python's collections.Counter and nltk.probability.FreqDist）：

alvas@ubi:~$ echo 'abandon, abandon, calm, clear' > file_test.txt
alvas@ubi:~$ python
Python 2.7.11+ (default, Apr 17 2016, 14:00:29) 
[GCC 5.3.1 20160413] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import io
>>> from collections import Counter
>>> with io.open('file_test.txt', 'r') as fin:
...     tokens = fin.read().strip().split(', ')
... 
>>> Counter(tokens)
Counter({u'abandon': 2, u'clear': 1, u'calm': 1})

要访问来自file_test.txt 的令牌计数并将它们与pos_dict.txt 的值相乘，我们使用.items() 函数遍历Counter 对象（就像我们如何访问字典的键值对一样）：

>>> import io
>>> from collections import Counter
>>> with io.open('file_test.txt', 'r') as fin:
...     tokens = fin.read().strip().split(', ')
... 
>>> 
>>> word_counts = Counter(tokens)
>>> with io.open('pos_dict.txt', 'r') as fin:
...     pos_dict = eval("{" + fin.read() + "}")
... 
>>>
>>> token_times_posdict = {word:freq*pos_dict[word] for word, freq in Counter(tokens).items()}
>>> token_times_posdict
{u'abandon': 4, u'clear': 1, u'calm': 2}

然后打印出来：

>>> for word, value in token_times_posdict.items():
...     print "{} -> {}".format(word, value)
... 
abandon -> 4
clear -> 1
calm -> 2

【讨论】：

我想我已经解决了你的“作业”，但请务必理解代码，而不仅仅是复制和粘贴它们。
顺便说一句，dict.items() 函数也不会按其值对键进行排序，请尝试查看 Counter 的函数，您会发现有一个函数可以做您想做的事情正在寻找（提示：from collections import Counter; dir(Counter) =）
感谢您的努力，这确实是一个很好的解决方案。我理解代码。由于我是语言处理和 nltk 方面的新手，我不知道将字符串表示为字典，但现在很清楚了。谢谢
我很高兴答案有所帮助。享受 Python 和 NLTK 的乐趣 =)
也许这对你理解 Python 的容器有一点帮助：github.com/usaarhat/pywarmups/blob/master/session2.md