【发布时间】:2021-11-22 20:52:20
【问题描述】:
我有一个具有以下结构的数据框:
import pandas as pd
import numpy as np
names = ['PersonA', 'PersonB', 'PersonC', 'PersonD','PersonE','PersonF']
team = ['Team1','Team2']
dates = pd.date_range(start = '2020-05-28', end = '2021-11-22')
df = pd.DataFrame({'runtime': np.repeat(dates, len(names)*len(team))})
df['name'] = len(dates)*len(team)*names
df['team'] = len(dates)*len(names)*team
df['A'] = 40 + 20*np.random.random(len(df))
df['B'] = .1 * np.random.random(len(df))
df['C'] = 1 +.5 * np.random.random(len(df))
我想创建一个数据框,显示在前一周、一个月、一年和所有时间等期间计算的运行时平均值,如下所示:
name | team | A_w | B_w | C_w| A_m | B_m | C_m | A_y | B_y | C_y | A_at | B_at | C_at
我已经使用此处描述的 lamda 方法成功地为平均值添加了一个计算列: How do I create a new column from the output of pandas groupby().sum()?
例如:
df = df.groupby(['name','team'], as_index=True).apply(lambda gdf: gdf.assign(A_at=lambda gdf: gdf['A'].mean()))
我的输出给了我一个额外的列:
runtime name team A B C A_at
0 2020-05-28 PersonA Team1 55.608186 0.027767 1.311662 49.957820
1 2020-05-28 PersonB Team2 43.481041 0.038685 1.144240 50.057015
2 2020-05-28 PersonC Team1 47.277667 0.012190 1.047263 50.151846
3 2020-05-28 PersonD Team2 41.995354 0.040623 1.087151 50.412061
4 2020-05-28 PersonE Team1 49.824062 0.036805 1.416110 50.073381
... ... ... ... ... ... ... ...
6523 2021-11-22 PersonB Team2 46.799963 0.069523 1.322076 50.057015
6524 2021-11-22 PersonC Team1 48.851620 0.007291 1.473467 50.151846
6525 2021-11-22 PersonD Team2 49.711142 0.051443 1.044063 50.412061
6526 2021-11-22 PersonE Team1 57.074027 0.095908 1.464404 50.073381
6527 2021-11-22 PersonF Team2 41.372381 0.059240 1.132346 50.094965
[6528 rows x 7 columns]
但这就是它变得混乱的地方......
我不需要运行时列,我不确定如何清理它,以便它只列出“名称”和“团队”列,此外......我一直在生成源数据框的方式(s) 是通过使用 for 循环为每个时间段重新创建整个数据帧:
for pt in runtimes[:d]:
<insert dataframe creation for d# of runtimes>
if d==7:
dfw = df.groupby(['name','team'], as_index=True).apply(lambda gdf: gdf.assign(A_w=lambda gdf: gdf['A'].mean()))
if d==30:
dfm = df.groupby(['name','team'], as_index=True).apply(lambda gdf: gdf.assign(A_m=lambda gdf: gdf['A'].mean()))
然后我尝试像这样连接输出:
dfs = pd.concat([dfw, dfm])
当 d
任何关于如何提高效率的提示将不胜感激。
【问题讨论】:
标签: python-3.x pandas dataframe