动态计算列表中多个单词的出现次数答案

【问题标题】：Dynamically count occurences of multiple words within lists动态计算列表中多个单词的出现次数
【发布时间】：2021-02-09 21:48:15
【问题描述】：

我正在尝试计算数据框每个短语中多个关键字的出现次数。这似乎与其他问题相似，但并不完全相同。

这里我们有一个 df 和一个包含关键字/主题的列表：

df=pd.DataFrame({'phrases':['very expensive meal near city center','very good meal and waiters','nice restaurant near center and public transport']})

topics=[['expensive','city'],['good','waiters'],['center','transport']]

对于每个短语，我们要计算每个单独主题中匹配的单词数。所以第一个短语应该为第一个主题得分为 2，为第二个主题得分为 0，为第三个主题得分为 1，等等

我试过了，但它不起作用：

from collections import Counter
topnum=0
for t in topics:
counts=[]
topnum+=1
results = Counter()
for line in df['phrases']:
  for c in line.split(' '):
    results[c] = t.count(c)
  counts.append(sum(results.values()))
df['topic_'+str(topnum)] = counts

我不确定我做错了什么，理想情况下，我最终会为每个主题/短语组合计算匹配单词，但计数似乎会重复：

phrases                                            topic_1  topic_2     topic_3
very expensive meal near city centre              2             0           0
very good meal and waiters                        2             2           0
nice restaurant near center and public transport  2             2           2

非常感谢任何可以帮助我的人。最好的祝福

【问题讨论】：

始终提供完整的minimal reproducible example，其中包含代码、数据、错误、当前输出和预期输出，如 formatted text。如果相关，只有绘图图像是可以的。请参阅How to ask a good question。使用How to provide a reproducible copy of your DataFrame using df.head(15).to_clipboard(sep=',') 提供数据，然后edit 您的问题，然后将剪贴板粘贴到代码块中。

标签： python pandas nlp token

【解决方案1】：

这是一个解决方案，它定义了一个名为 find_count 的辅助函数并将其作为 lambda 应用到数据帧。

import pandas as pd
df=pd.DataFrame({'phrases':['very expensive meal near city center','very good meal and waiters','nice restaurant near center and public transport']})
topics=[['expensive','city'],['good','waiters'],['center','transport']]

def find_count(row, topics_index):
    count = 0
    word_list = row['phrases'].split()
    for word in word_list:
        if word in topics[topics_index]:
            count+=1
    return count

df['Topic 1'] = df.apply(lambda row:find_count(row,0), axis=1)
df['Topic 2'] = df.apply(lambda row:find_count(row,1), axis=1)
df['Topic 3'] = df.apply(lambda row:find_count(row,2), axis=1)

print(df)

#Output
                                            phrases  Topic 1  Topic 2  Topic 3
0              very expensive meal near city center        2        0        1
1                        very good meal and waiters        0        2        0
2  nice restaurant near center and public transport        0        0        2

【讨论】：

对不起，当我为第一个短语主题 3 运行代码时，我实际上没有得到与你相同的输出，而不是得到 1，这是正确的..
确保复制并粘贴我的代码准确无误。我打印的输出是这段代码的输出。你有没有以某种方式改变它？
啊是的，很抱歉它现在正在工作，我一定做错了什么！我想我试图循环运行它，因为我实际上有数千个短语和 60 个主题..
现在一切都好，我添加了：topic_n=0 for each in topics：topic_n+=1 df['topic'+str(topic_n)]=df.apply(lambda row:find_count(row,topic_n -1),axis=1) df.head()
好。你让我担心了一秒钟。哈哈。