【问题标题】:Distribute a row over other rows sharing same key将一行分布在共享相同键的其他行上
【发布时间】:2020-01-07 14:19:06
【问题描述】:

我有一个如下所示的数据框:

+------+------------+-------+--------------+
| name |    date    | value | replacement |
+------+------------+-------+--------------+
| A    | 20/11/2016 |    10 | NaN          |
| C    | 20/11/2016 |    8  | [A,B]        |
| B    | 20/11/2016 |    12 | NaN          |
| E    | 25/12/2016 |    16 | NaN          |
| F    | 25/12/2016 |    18 | NaN          |
| D    | 25/12/2016 |    11 | [E,F]        |
+------+------------+-------+--------------+

我想做什么:
对于在列 'replacement' 中有名称列表的每一行,我希望它的 'value' 平均分布在包含这些替换 + 的行中同一天。
对于前面的示例,输出将如下所示:

+------+------------+-------+------------------+
| name |    date    | value | additional value |
+------+------------+-------+------------------+
| A    | 20/11/2016 |    10 |                4 |
| B    | 20/11/2016 |    12 |                4 |
| A    | 25/12/2016 |    16 |              5.5 |
| B    | 25/12/2016 |    18 |              5.5 |
+------+------------+-------+------------------+

我设法找到了一种直接执行分配的方法,而无需通过拆分这些行并按名称 + 日期分组来创建新列,但是 1/ 它太慢了 + 2/我确实需要创建那个额外的列并且可以找不到办法。

【问题讨论】:

  • 列表的长度是否总是等于该行之前的记录数?

标签: python pandas dataframe data-processing


【解决方案1】:

想法是通过replacement 列表和Series.str.len 然后DataFrame.explode (pandas 0.25+) 将它们创建为标量的新列。将列 value 除以 newmerge 除以原始列名称以添加原始列:

df1 = df.assign(new=df['replacement'].str.len()).explode('replacement')
df1['new'] = df1['value'].div(df1['new'])

df1 = df1[['name','date','value']].merge(df1[['replacement','date','new']],
                                    left_on=['name','date'],
                                    right_on=['replacement','date'])
df1['replacement'] = df1.pop('new')
print (df1)
  name        date  value  replacement
0    A  20/11/2016     10          4.0
1    B  20/11/2016     12          4.0
2    A  25/12/2016     16          5.5
3    B  25/12/2016     18          5.5

类似的解决方案是通过删除而不是选择:

df1 = df.assign(new=df['replacement'].str.len()).explode('replacement')
df1['new'] = df1['value'].div(df1['new'])

df1 = df1.drop(['replacement','new'],1).merge(df1.drop(['name','value'],1),
                                        left_on=['name','date'],
                                        right_on=['replacement','date'])
df1['replacement'] = df1.pop('new')
print (df1)
  name        date  value  replacement
0    A  20/11/2016     10          4.0
1    B  20/11/2016     12          4.0
2    A  25/12/2016     16          5.5
3    B  25/12/2016     18          5.5

【讨论】:

    【解决方案2】:

    这是使用explode(需要pandas 0.25+)和groupby的另一种方式:

    m = df[[isinstance(i,list) for i in df.replacement]] #df which has lists in replacement col
    
    g = m.explode('replacement').groupby('date') #explode and groupby by date
    #drop indices of m and assign the divided value
    final = df.drop(m.index).set_index('date').assign(
          replacement=(g['value'].mean()/g.size())).reset_index() 
    

             date name  value  replacement
    0  20/11/2016    A   10.0          4.0
    1  20/11/2016    B   12.0          4.0
    2  25/12/2016    A   16.0          5.5
    3  25/12/2016    B   18.0          5.5
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2016-08-28
      • 2022-01-23
      • 1970-01-01
      • 1970-01-01
      • 2020-02-28
      • 1970-01-01
      相关资源
      最近更新 更多