如何计算单词及其相关组？答案

【问题标题】：How to count words and their associated groups?如何计算单词及其相关组？
【发布时间】：2019-09-17 16:52:14
【问题描述】：

我想计算一个特定主题在很长的单词列表中出现的次数。目前，我有一个字典，其中外键是主题，内键是该主题的关键字。

我正在尝试有效地计算关键字出现次数并保持其对应主题出现次数的累积总和。

最终，我想保存多个文本的输出。这是我目前实施的一个例子。我遇到的问题是它非常慢，并且它不会将关键字计数存储在输出 DataFrame 中。是否有解决这些问题的替代方案？

import pandas as pd

topics = {
    "mathematics": {
        "analysis": 0,
        "algebra": 0,
        "logic": 0
    },
    "philosophy": {
        "ethics": 0,
        "metaphysics": 0,
        "epistemology": 0
    }
}

texts = {
    "text_a": [
        "the", "major", "areas", "of", "study", "in", "mathematics", "are",
        "analysis", "algebra", "and", "logic", "in", "philosophy", "they",
        "are", "ethics", "metaphysics", "and", "epistemology"
    ],
    "text_b": [
        "logic", "is", "studied", "both", "in", "mathematics", "and",
        "philosophy"
    ]
}

topics_by_text = pd.DataFrame()
for title, text in texts.items():
    topic_count = {}
    for topic, sub_dict in topics.items():
        curr_topic_counter = 0
        for keyword, count in sub_dict.items():
            keyword_occurrences = text.count(keyword)
            topics[topic][keyword] = keyword_occurrences
            curr_topic_counter += keyword_occurrences
        topic_count[topic] = curr_topic_counter
    topics_by_text[title] = pd.Series(topic_count)


print(topics_by_text)

【问题讨论】：

标签： python performance loops dictionary counter

【解决方案1】：

不确定速度，但以下代码以简洁的 MultiIndexed 方式存储关键字计数。

# Returns a count dictionary 
def CountFrequency(my_list, keyword): 
    freq = {} 
    for item in my_list: 
      freq[item] = 0
      if (item in freq): 
          freq[item] += 1
      else: 
          freq[item] = 1

    dict_ = {}
    for your_key,value in keyword.items():
      try:
        dict_.update({your_key: freq[your_key]})
      except:
        dict_.update({your_key: 0})

    dict_['count'] = sum([value if (value != None) else 0 for value in dict_.values()])
    return dict_

# Calculates count
output = {}
for key, value in texts.items():
  for topic, keywords in topics.items():
    try:
      output[topic][key] = CountFrequency(value,keywords)
    except KeyError:
      output[topic] = {}
      output[topic][key] = CountFrequency(value,keywords)

# To DataFrame
dict_of_df = {k: pd.DataFrame(v) for k,v in output.items()}
df = pd.concat(dict_of_df, axis=0)
df.T

【讨论】：