【发布时间】:2020-10-19 14:00:05
【问题描述】:
嗨,这是我的(玩具)数据:
data = {'p1': [100., 101, 102, 100, 100],
'p2': [100., 99., 98., 100., 100],
'p3': [1000., 1000., 100., 1000., 1000]
}
df = (pd.DataFrame(data, index=pd.bdate_range(start='20100101', periods=5))
.stack()
.reset_index()
.rename(columns={'level_0': 'date', 'level_1': 'type', 0: 'price'})
.sort_values('date')
)
df['perf'] = df.groupby('type')['price'].apply(lambda x: x.pct_change(1))
df.sort_values('type')
看起来像这样:
0 2010-01-01 p1 100.0 NaN
3 2010-01-04 p1 101.0 0.010000
6 2010-01-05 p1 102.0 0.009901
9 2010-01-06 p1 100.0 -0.019608
12 2010-01-07 p1 100.0 0.000000
1 2010-01-01 p2 100.0 NaN
4 2010-01-04 p2 99.0 -0.010000
7 2010-01-05 p2 98.0 -0.010101
10 2010-01-06 p2 100.0 0.020408
13 2010-01-07 p2 100.0 0.000000
2 2010-01-01 p3 1000.0 NaN
5 2010-01-04 p3 1000.0 0.000000
8 2010-01-05 p3 100.0 -0.900000 -> outlier
11 2010-01-06 p3 1000.0 9.000000. -> outlier
14 2010-01-07 p3 1000.0 0.000000
我想用没有这些数据的 perf 列的平均值或中值替换这 (2) 个值。我的意思是我计算(在以前的帮助下):
# perf for each type
df['perf'] = df.groupby('type')['price'].apply(lambda x: x.pct_change(1))
# Outliers & replace value with median by date
outliers = df.groupby('type')['price'].apply(lambda x: (x.pct_change(1).abs() >= 0.5))
df.loc[outliers, "perf"] = (df[~outliers]
.groupby('date')
.median()
.loc[df.loc[outliers, "date"], "perf"]
.values
)
df['price2'] = (df.groupby('type')['price'].transform(lambda x: x.iloc[0])).mul(df.groupby('type')['perf'].apply(lambda x: (1+x).cumprod()), fill_value=1)
# New price with the same initial value of the prices but with perf corrected
df.sort_values('type')
但最后它并不“好”。有没有办法通过函数来改进我的代码?
【问题讨论】:
标签: python pandas dataframe group-by