计算文本文件中几篇文章中特定单词的频率答案

【问题标题】：Count frequency of specific words in several articles in a text file计算文本文件中几篇文章中特定单词的频率
【发布时间】：2017-03-29 11:08:40
【问题描述】：

我想计算单个文本文件中包含的每篇文章的单词列表的出现次数。每篇文章都可以被识别，因为它们都以一个共同的标签“

广告'”开头。

这是文本文件的示例：

"[<p>Advertisement ,   By   TIM ARANGO  ,     SABRINA TAVERNISE   and     CEYLAN YEGINSU    JUNE 28, 2016 
 ,Credit Ilhas News Agency, via Agence France-Presse — Getty Images,ISTANBUL ......]
[<p>Advertisement ,   By  MILAN SCHREUER  and     ALISSA J. RUBIN    OCT. 5, 2016 
 ,  BRUSSELS — A man wounded two police officers with a knife in Brussels around noon 
on Wednesday in what the authorities called “a potential terrorist attack.” ,  
The two ......]"

我想要做的是计算每个单词的频率我有一个 csv 文件（20 个单词）并像这样写输出：

  id, attack, war, terrorism, people, killed, said 
  article_1, 45, 5, 4, 6, 2,1
  article_2, 10, 3, 2, 1, 0,0

csv中的单词是这样存储的：

attack
people
killed
attacks
state
islamic

按照建议，我首先尝试通过标签<p> 拆分整个文本文件，然后再开始计算单词。然后我标记了文件文本中的列表。

这是我目前所拥有的：

opener = open("News_words_most_common.csv")
words = opener.read()
my_pattern = ('\w+')
x = re.findall(my_pattern, words)

file_open = open("Training_News_6.csv")
files = file_open.read()
r = files.lower()
stops = set(stopwords.words("english"))
words = r.split("<p>")
token= word_tokenize(words)
string = str(words)
token= word_tokenize(string)
print(token)

这是输出：

['[', "'", "''", '|', '[', "'", ',', "'advertisement", 
',', 'by', 'milan', 'schreuer'.....']', '|', "''", '\\n', "'", ']']

下一步将围绕拆分的文章进行循环（现在转入标记化的单词列表）并计算第一个文件中单词的频率。如果您对如何交互和计数有任何建议，请告诉我！

我在 Anaconda 上使用 Python 3.5

【问题讨论】：

相关stackoverflow.com/a/14921469/4063051
是的，它是相关的。我知道如何使用计数器模块。我已经这样做了来创建单词列表。最重要的是计算我的单个文本文件中包含的每篇文章中列表中单词的频率。

标签： python python-3.x counter word-frequency

【解决方案1】：

你可以尝试使用pandas和sklearn：

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

vocabulary = [word.strip() for word in open('vocabulary.txt').readlines()]
corpus = open('articles.txt').read().split('<p>Advertisement')

vectorizer = CountVectorizer(min_df=1, vocabulary=vocabulary)
words_matrix = vectorizer.fit_transform(corpus)
df = pd.DataFrame(data=words_matrix.todense(), 
                  index=('article_%s' % i for i in range(words_matrix.shape[0])),
                  columns=vectorizer.get_feature_names())
df.index.name = 'id'
df.to_csv('articles.csv')

在文件articles.csv:

$ cat articles.csv
id,attack,people,killed,attacks,state,islamic
article_0,0,0,0,0,0,0
article_1,0,0,0,0,0,0
article_2,1,0,0,0,0,0

【讨论】：

我得到一个错误：NameError: 'name 'data' is not defined'
我更正了错误似乎有效。再次感谢！
哦，抱歉，在代码净化时，我在一个地方重命名了一个变量，但在另一个地方忘记了。如果您有任何问题，请随时提出。

【解决方案2】：

您可以尝试阅读您的文本文件，然后在'<p>' 处拆分（如果如您所说，它们用于标记新文章的开头），然后您就有了文章列表。一个带计数的简单循环就可以了。

我建议你看看 nltk 模块。我不确定你的最终目标是什么，但 nltk 确实很容易实现功能来做这些事情等等（例如，你可以计算频率，而不是仅仅查看每篇文章中某个单词出现的次数，甚至按逆文档频率（称为 tf-idf）对其进行缩放。

【讨论】：

我根据您的建议编辑了我的问题。是的，我已经在任务的第一部分使用了 nltk tf 函数。但是我没有使用 tf-idf 来解决上述问题（在不同的文章中拆分文本）。但是，我不知道我是否正确使用了拆分模块

【解决方案3】：

也许我没有很好地完成任务......

如果您要进行文本分类，使用标准 scikit 矢量化器可能会很方便，例如 Bag of Words，它接受文本并返回一个包含单词的数组。如果你真的需要 csv，你可以直接在分类器中使用它或输出到 csv。它已经包含在 scikit 和 Anaconda 中。

另一种方法 - 是手动拆分。您可以加载数据、拆分为单词、计算它们、排除停用词（它是什么？）并放入输出结果文件。喜欢：

    import re
    from collections import Counter
    txt = open('file.txt', 'r').read()
    words = re.findall('[a-z]+', txt, re.I)
    cnt = Counter(_ for _ in words if _ not in stopwords)

【讨论】：

首先感谢您的帮助。实际上任务已经定义好了。我已经在另一个 csv 文件中有最常用的词（在整个文档文件中）。现在我要做的，就是统计这些词（20个词）在每篇文章中出现的频率。在我看来，每篇文章都存储在一个单独的 csv 文件中。
最终输出应该是Article_1, 3, 4, 45, 32等。数字表示每篇文章中单词的频率（来自csv文件）。

【解决方案4】：

这个怎么样：

import re
from collections import Counter
csv_data = [["'", "\\n", ","], ['fox'],
            ['the', 'fox', 'jumped'],
            ['over', 'the', 'fence'],
            ['fox'], ['fence']]
key_words = ['over', 'fox']
words_list = []

for i in csv_data:
    for j in i:
        line_of_words = ",".join(re.findall("[a-zA-Z]+", j))
        words_list.append(line_of_words)
word_count = Counter(words_list)

match_dict = {}
for aword, word_freq in zip(word_count.keys(), word_count.items()):
    if aword in key_words:
        match_dict[aword] = word_freq[1]

结果：

print('Article words: ', words_list)
print('Article Word Count: ', word_count)
print('Matches: ', match_dict)

Article words:  ['', 'n', '', 'fox', 'the', 'fox', 'jumped', 'over', 'the', 'fence', 'fox', 'fence']
Article Word Count:  Counter({'fox': 3, '': 2, 'the': 2, 'fence': 2, 'n': 1, 'over': 1, 'jumped': 1})
Matches:  {'over': 1, 'fox': 3}

【讨论】：

感谢您的建议，主要问题是单词应该计入任何嵌套列表中：这样我就可以计算第一篇文章、第二篇文章等的频率（分别）。在您的代码中，它同时计算所有文章的频率
由于文章被“
”分割，您可以首先循环嵌套的 csv 数据，同时将所有元素添加到列表中，直到遇到“

”，在这种情况下开始一个新的列出并开始向其中添加所有元素，依此类推。然后你可以在每个列表上运行上面的方法。