【问题标题】:Get value from previous column for each group in groupby从 groupby 中每个组的上一列获取值
【发布时间】:2022-01-19 23:11:10
【问题描述】:

这是我的df -

Site Product Period Inflow Outflow Production Opening Inventory New Opening Inventory Closing Inventory Production Needed
California Apples 1 0 3226 4300 1213 1213 0 0
California Apples 2 0 3279 3876 0 0 0 0
California Apples 3 0 4390 4530 0 0 0 0
California Apples 4 0 4281 3870 0 0 0 0
California Apples 5 0 4421 4393 0 0 0 0
California Oranges 1 0 505 400 0 0 0 0
California Oranges 2 0 278 505 0 0 0 0
California Oranges 3 0 167 278 0 0 0 0
California Oranges 4 0 124 167 0 0 0 0
California Oranges 5 0 106 124 0 0 0 0
Montreal Maple Syrup 1 0 445 465 293 293 0 0
Montreal Maple Syrup 2 0 82 398 0 0 0 0
Montreal Maple Syrup 3 0 745 346 0 0 0 0
Montreal Maple Syrup 4 0 241 363 0 0 0 0
Montreal Maple Syrup 5 0 189 254 0 0 0 0

如图所示,按SiteProduct 分组时,共有三个组。对于三个组中的每一个,我都想执行以下操作(第 2 到第 5 阶段)-

  • New Opening Inventory设置为上一期的Closing Inventory
  • 使用公式计算下一个周期的Closing InventoryClosing Inventory = Production + Inflow + New Opening Inventory - Outflow

我正在尝试使用groupbyfor loop 的组合来解决这个问题

这是我目前所拥有的 -

如果df 是一个单独的组,我可以简单地做

# calculate closing inventory of period 1
df['Closing Inventory'] = np.where(df['PeriodNo']==1, <formula>, 0)

for i in range(1, len(df)):
    df.loc[i, 'New Opening Inventory'] = df.loc[i-1, 'Closing Inventory']
    df.loc[i, 'Closing Inventory'] = df.loc[i, 'Production'] + df.loc[i, 'Inflow'] + df.loc[i, 'New Opening Inventory'] - df.loc[i, 'Outflow']

当我尝试将此for loop 嵌套在groups 上的循环中时

# calculate closing inventory of period 1 for all groups
df['Closing Inventory'] = np.where(df['PeriodNo']==1, <formula>, 0)

g = df.groupby(['Site', 'Product']

alist = []

for k in g.groups.keys():
    temp = g.get_group(k).reset_index(drop=True)
    for i in range(1, len(temp)):
        temp.loc[i, 'New Opening Inventory'] = temp.loc[i-1, 'Closing Inventory']
        temp.loc[i, 'Closing Inventory'] = temp.loc[i, 'Production'] + temp.loc[i, 'Inflow'] + temp.loc[i, 'New Opening Inventory'] - temp.loc[i, 'Outflow']
    alist.append(temp)

df2 = pd.concat(alist, ignore_index=True)
df2

此解决方案有效,但使用嵌套循环似乎非常低效。有没有更好的方法来做到这一点?

【问题讨论】:

    标签: python pandas dataframe pandas-groupby


    【解决方案1】:

    您的新期初库存始终是之前的期末库存。

    所以我可以修改这个公式

    期末库存 = 生产 + 流入 + 新期初库存 - 流出

    期末库存 = 生产 + 流入 + 以前的期末库存 - 流出

    对于第一行,您没有期末库存。但是从第 2 行开始,您计算期末库存并将期末库存结转到下一行。

    在获取Closing Inventory之前,首先使用列表推导计算“Production” + “Inflow” - “Overflow”。列表推导比 for 循环执行得更好。

    df['Closing Inventory'] = [x + y - z if p > 1 else 0 for p, x, y, z in zip(df['Period'], df['Production'], df['Inflow'], df['Outflow'])]
    
    # df[['Site', 'Product', 'Closing Inventory']]
    #         Site  Product Closing Inventory
    # 0 California  Apples                  0
    # 1 California  Apples                597
    # 2 California  Apples                140
    # 3 California  Apples               -411
    # 4 California  Apples                -28
    # 5 California  Oranges                 0
    # 6 California  Oranges               227
    # 7 California  Oranges               111
    # ...
    

    然后,公式的其余部分是添加之前计算的期末库存,这意味着您可以cumsum 这个结果。

    For row 1, Previous Closing (0) + calculated part (597) = 597
    For row 2, Previous Closing (597) + calculated part (140) = 737
    ...
    
    df['Closing Inventory'] = df.groupby(['Site', 'Product'])['Closing Inventory'].cumsum()
    
    # df[['Site', 'Product', 'Closing Inventory']]
    #         Site  Product Closing_Inventory
    # 0 California  Apples                  0
    # 1 California  Apples                597
    # 2 California  Apples                737
    # 3 California  Apples                326
    # 4 California  Apples                298
    # 5 California  Oranges                 0
    # 6 California  Oranges               227
    # 7 California  Oranges               338
    # ...
    

    同样,新的期初库存始终是之前的期末库存,除非期间为 1。因此,首先,移动期末库存,然后在期间为 1 时选择新的期初库存。

    我使用combine_first 从新的期初或期末库存中挑选价值。

    df['New Opening Inventory'] = (df['New Opening Inventory'].replace(0, np.nan)
                                   .combine_first(
                                       df.groupby(['Site', 'Product'])['Closing Inventory']
                                       .shift()
                                       .fillna(0)
                                   ).astype(int))
    

    结果

              Site  Product Period  New Opening Inventory Closing Inventory
    0   California  Apples       1                   1213                 0
    1   California  Apples       2                      0               597
    2   California  Apples       3                    597               737
    3   California  Apples       4                    737               326
    4   California  Apples       5                    326               298
    5   California  Oranges      1                      0                 0
    6   California  Oranges      2                      0               227
    7   California  Oranges      3                    227               338
    ...
    

    使用我笔记本电脑上的示例数据,

    Original solution: 8.44 ms ± 280 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    This solution: 2.95 ms ± 23.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    

    我认为通过列表理解和矢量化功能,这个解决方案可以执行得更快。

    【讨论】:

    • 我想我得到了和你一样的结果,但如果逻辑看起来很奇怪,请告诉我。
    • 哇,感谢您的解释和有效的解决方案!
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2020-12-19
    • 2020-03-09
    • 2021-11-18
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多