Python：如何将标记列表添加到数据框的新列答案

【问题标题】：Python: How to add list of tokens to new column of dataframePython：如何将标记列表添加到数据框的新列
【发布时间】：2019-02-12 11:44:32
【问题描述】：

我有一个超过 50 行的大型数据框。对于每一行，我都有一列“标记”，其中包含大量文本标记。我使用了一个 for 循环和频率分布来查找“令牌”列的每一行中的前 10 个令牌。

我正在尝试向我的数据框添加一个名为“top10”的新列，这样对于每一行，前 10 个标记都包含在“top10”列中。

这是我用来查找每行前 10 个标记的当前代码。

for i in range(len(df)):
   tokens = df.iloc[i]['tokens']
   frequency = nltk.FreqDist(tokens)
   print(" ", word_frequency.most_common(10))

我的数据框示例：

id location about age tokens
1    usa     ...  20   ['jim','hi','hello'......]
...
... 
40    uk     ...  50   ['bobby','hi','hey'......]

预期输出：

id location about age tokens                           top10
1    usa     ...  20   ['jim','hi','hello'......]   ['hi', 'paddy'....]
...
... 
40    uk     ...  50   ['bobby','hi','hey'......]   ['john', 'python'..]

top10 列应按降序显示单词。

感谢您的帮助，谢谢！

【问题讨论】：

标签： python pandas token

【解决方案1】：

这是向 DF 添加新列的简单方法：

df['top10'] = word_frequency.most_common(10)

【讨论】：

列出 most_common(10) 然后尝试

【解决方案2】：

pandas apply 带有关键字参数reduce（不展开列表）和axis=1（默认为行，而不是列）更好，因为您已经在遍历行。 Pandas 将您的列表解释为系列，不适合单个单元格。

import pandas as pd
import nltk

df =  pd.DataFrame({x :{'tokens': ['hello', 'python', 'is', 'is', 'is', 'dog', 'god', 'cat', 'act', 'fraud', 'hola', 'the', 'a', 'the', 'on', 'no', 'of', 'foo', 'foo']} for x in range(0,10)} ).T


def most_common_words_list (x):
    word_count_tups = nltk.FreqDist(x['tokens']).most_common(2)
    return [word for word, count in word_count_tups]

df ['top2'] = df.apply(most_common_words_list,  result_type='reduce', axis=1)

【讨论】：

你在FreqDist(x['tokens'])的括号里打错了吗？在这里有效
“应用”命令替换了整个 for-loop。 apply 调用会为您循环每一行
你需要这条线df ['top10'] = df.apply(lambda x: nltk.FreqDist(x['tokens']).most_common(10), result_type='reduce', axis=1)正是这个