在 python pandas 中获取列值的列表和数据框映射答案

【问题标题】：A list and a dataframe mapping to get a column value in python pandas在 python pandas 中获取列值的列表和数据框映射
【发布时间】：2021-05-03 09:52:28
【问题描述】：

我有一个数据框，其中单词作为索引，另一列中有相应的情绪分数。然后，我有另一个数据框，其中有一列带有多行的单词列表（令牌列表）。所以每一行都会有一个包含不同列表的列。我想找到特定列表的平均情绪得分。这必须针对大量行进行，因此效率很重要。我想到的一种方法如下：

import pandas as pd
a = [['a', 'b', 'c'], ['hi', 'this', 'is', 'a', 'sample']]
df = pd.DataFrame()
df['tokens'] = a

'''
df
                       words
0                  [a, b, c]
1  [hi, this, is, a, sample]
'''

def find_score(tokenlist, ref_df):
    # ref_df contains two cols, 'tokens' and 'score'
    temp_df = pd.DataFrame()
    temp_df['tokens'] = tokenlist
    return temp_df.merge(ref_df, on='tokens', how='inner')['sentiment_score'].mean(axis=0) 
    # this should return score

df['score'] = df['tokens'].apply(find_score, axis=1, args=(ref_df))
# each input for find_score will be a list

在不为每个列表创建数据框的情况下，有没有更有效的方法？

【问题讨论】：

标签： python pandas list dataframe multiprocessing

【解决方案1】：

您可以从参考数据帧ref_df 创建一个映射字典，然后在数据帧df 的每一行上的每个标记列表上使用.map()，如下所示：

ref_dict = dict(zip(ref_df['tokens'], ref_df['sentiment_score']))
df['score'] = df['tokens'].map(lambda x: np.mean([ref_dict[y] for y in x if y in ref_dict.keys()]))

演示

测试数据构建

a = [['a', 'b', 'c'], ['hi', 'this', 'is', 'a', 'sample']]
df = pd.DataFrame()
df['tokens'] = a

ref_df = pd.DataFrame({'tokens': ['a', 'b', 'c', 'd', 'hi', 'this', 'is', 'sample', 'example'], 
'sentiment_score': [1, 2, 3, 4, 11, 12, 13, 14, 15]})

print(df)

                      tokens
0                  [a, b, c]
1  [hi, this, is, a, sample]


print(ref_df)

    tokens  sentiment_score
0        a                1
1        b                2
2        c                3
3        d                4
4       hi               11
5     this               12
6       is               13
7   sample               14
8  example               15

运行新代码

ref_dict = dict(zip(ref_df['tokens'], ref_df['sentiment_score']))
df['score'] = df['tokens'].map(lambda x: np.mean([ref_dict[y] for y in x if y in ref_dict.keys()]))

输出

print(df)

                      tokens  score
0                  [a, b, c]    2.0
1  [hi, this, is, a, sample]   10.2

【讨论】：

ref_df 比较大。这应该不是问题吧？我将尝试这两种方法并尝试在此处添加两种方法所花费的时间。
@Jihjohn dict(zip()) 相对较快。所以应该不是问题。我稍微微调了代码以支持在 ref_df 中找不到任何单词时的情况。如果您的用例可能出现这种情况，您可以使用上面最新的稍微增强的代码。

【解决方案2】：

让我们试试explode、merge和agg：

import pandas as pd

a = [['a', 'b', 'c'], ['hi', 'this', 'is', 'a', 'sample']]
df = pd.DataFrame()
df['tokens'] = a

ref_df = pd.DataFrame({'sentiment_score': {'a': 1, 'b': 2,
                                           'c': 3, 'hi': 4,
                                           'this': 5, 'is': 6,
                                           'sample': 7}})

# Explode Tokens into rows (Preserve original index)
new_df = df.explode('tokens').reset_index()
# Merge sentiment_scores
new_df = new_df.merge(ref_df, left_on='tokens',
                      right_index=True,
                      how='inner')
# Group By Original Index and agg back to lists and take mean
new_df = new_df.groupby('index') \
    .agg({'tokens': list, 'sentiment_score': 'mean'}) \
    .reset_index(drop=True)
print(new_df)

输出：

代币sentiment_score 0 [a, b, c] 2.0 1 [a, hi, this, is, sample] 4.6

爆炸后：

索引令牌 0 0 一个 1 0 乙 2 0℃ 3 1 嗨 4 1 这个 5 1 是 6 1个 7 1 个样本

合并后

指数标记sentiment_score 0 0 一 1 1 1 一 1 2 0 乙 2 3 0 c 3 4 1 嗨 4 5 1 这 5 6 1 是 6 7 1 样品 7

（单行）

new_df = df.explode('tokens') \
    .reset_index() \
    .merge(ref_df, left_on='tokens',
           right_index=True,
           how='inner') \
    .groupby('index') \
    .agg({'tokens': list, 'sentiment_score': 'mean'}) \
    .reset_index(drop=True)

如果列表中标记的顺序很重要，可以计算分数并将其合并回原始df，而不是使用列表聚合：

mean_scores = df.explode('tokens') \
    .reset_index() \
    .merge(ref_df, left_on='tokens',
           right_index=True,
           how='inner') \
    .groupby('index').mean() \
    .reset_index(drop=True)

new_df = df.merge(mean_scores,
                  left_index=True,
                  right_index=True)
print(new_df)

输出：

代币sentiment_score 0 [a, b, c] 2.0 1 [hi, this, is, a, sample] 4.6

【讨论】：