数据框优化分组和计算字段答案

【问题标题】：Dataframe optimize groupby & calculated fields数据框优化分组和计算字段
【发布时间】：2021-11-22 20:52:20
【问题描述】：

我有一个具有以下结构的数据框：

import pandas as pd
import numpy as np

names = ['PersonA', 'PersonB', 'PersonC', 'PersonD','PersonE','PersonF']
team = ['Team1','Team2']
dates = pd.date_range(start = '2020-05-28', end = '2021-11-22')

df = pd.DataFrame({'runtime': np.repeat(dates, len(names)*len(team))})
df['name'] = len(dates)*len(team)*names
df['team'] = len(dates)*len(names)*team
df['A'] = 40 + 20*np.random.random(len(df))
df['B'] = .1 * np.random.random(len(df))
df['C'] = 1 +.5 * np.random.random(len(df))

我想创建一个数据框，显示在前一周、一个月、一年和所有时间等期间计算的运行时平均值，如下所示：

name | team | A_w | B_w | C_w| A_m | B_m | C_m | A_y | B_y | C_y | A_at | B_at | C_at

我已经使用此处描述的 lamda 方法成功地为平均值添加了一个计算列： How do I create a new column from the output of pandas groupby().sum()?

例如： df = df.groupby(['name','team'], as_index=True).apply(lambda gdf: gdf.assign(A_at=lambda gdf: gdf['A'].mean()))

我的输出给了我一个额外的列：

   runtime     name   team          A         B         C       A_at
0    2020-05-28  PersonA  Team1  55.608186  0.027767  1.311662  49.957820
1    2020-05-28  PersonB  Team2  43.481041  0.038685  1.144240  50.057015
2    2020-05-28  PersonC  Team1  47.277667  0.012190  1.047263  50.151846
3    2020-05-28  PersonD  Team2  41.995354  0.040623  1.087151  50.412061
4    2020-05-28  PersonE  Team1  49.824062  0.036805  1.416110  50.073381
...         ...      ...    ...        ...       ...       ...        ...
6523 2021-11-22  PersonB  Team2  46.799963  0.069523  1.322076  50.057015
6524 2021-11-22  PersonC  Team1  48.851620  0.007291  1.473467  50.151846
6525 2021-11-22  PersonD  Team2  49.711142  0.051443  1.044063  50.412061
6526 2021-11-22  PersonE  Team1  57.074027  0.095908  1.464404  50.073381
6527 2021-11-22  PersonF  Team2  41.372381  0.059240  1.132346  50.094965

[6528 rows x 7 columns]

但这就是它变得混乱的地方......

我不需要运行时列，我不确定如何清理它，以便它只列出“名称”和“团队”列，此外......我一直在生成源数据框的方式(s) 是通过使用 for 循环为每个时间段重新创建整个数据帧：

for pt in runtimes[:d]:
  <insert dataframe creation for d# of runtimes>
  if d==7:
    dfw = df.groupby(['name','team'], as_index=True).apply(lambda gdf: gdf.assign(A_w=lambda gdf: gdf['A'].mean()))
  if d==30:
    dfm = df.groupby(['name','team'], as_index=True).apply(lambda gdf: gdf.assign(A_m=lambda gdf: gdf['A'].mean()))

然后我尝试像这样连接输出：

dfs = pd.concat([dfw, dfm])

当 d

任何关于如何提高效率的提示将不胜感激。

【问题讨论】：

标签： python-3.x pandas dataframe

【解决方案1】：

更新... 通过执行以下操作，我已经能够制定出体面的输出：

dfs = pd.DataFrame(columns=['name','team'])
for pt in runtimes[:d]:
    if d == 7:
      df = <insert dataframe creation for d# of runtimes>
      dfw = df.groupby(['name', 'team'], as_index=True).apply(lambda gdf: gdf.assign(A_w=lambda gdf: gdf['A'].mean()))
      ...
      dfw = dfw[['name', 'A_w','B_w','C_w','team']]
      dfs = pd.merge(dfs, dfw, how='inner', on=['name', 'team'])
    if d == 30:
      df = <insert dataframe creation for d# of runtimes>
      dfm = df.groupby(['name', 'team'], as_index=True).apply(lambda gdf: gdf.assign(A_m=lambda gdf: gdf['A'].mean()))
      ...
      dfm = dfm[['name', 'A_m','B_m','C_m','team']]
      dfs = pd.merge(dfs, dfm, how='inner', on=['name', 'team'])

这给了我期望的输出。

【讨论】：