【问题标题】:how do I replace outliers with groupby?如何用 groupby 替换异常值?
【发布时间】:2020-10-19 14:00:05
【问题描述】:

嗨,这是我的(玩具)数据:

data = {'p1': [100., 101, 102, 100, 100],
        'p2': [100., 99., 98., 100., 100],
        'p3': [1000., 1000., 100., 1000., 1000]
        }
df = (pd.DataFrame(data, index=pd.bdate_range(start='20100101', periods=5))
      .stack()
      .reset_index()
      .rename(columns={'level_0': 'date', 'level_1': 'type', 0: 'price'})
      .sort_values('date')
      )
df['perf'] = df.groupby('type')['price'].apply(lambda x: x.pct_change(1))
df.sort_values('type')

看起来像这样:

0   2010-01-01  p1  100.0   NaN
3   2010-01-04  p1  101.0   0.010000
6   2010-01-05  p1  102.0   0.009901
9   2010-01-06  p1  100.0   -0.019608
12  2010-01-07  p1  100.0   0.000000
1   2010-01-01  p2  100.0   NaN
4   2010-01-04  p2  99.0    -0.010000
7   2010-01-05  p2  98.0    -0.010101
10  2010-01-06  p2  100.0   0.020408
13  2010-01-07  p2  100.0   0.000000
2   2010-01-01  p3  1000.0  NaN
5   2010-01-04  p3  1000.0  0.000000
8   2010-01-05  p3  100.0   -0.900000  -> outlier
11  2010-01-06  p3  1000.0  9.000000.  -> outlier
14  2010-01-07  p3  1000.0  0.000000

我想用没有这些数据的 perf 列的平均值或中值替换这 (2) 个值。我的意思是我计算(在以前的帮助下):

# perf for each type 
df['perf'] = df.groupby('type')['price'].apply(lambda x: x.pct_change(1))

# Outliers & replace value with median by date 
outliers = df.groupby('type')['price'].apply(lambda x: (x.pct_change(1).abs() >= 0.5))
df.loc[outliers, "perf"] = (df[~outliers]
                            .groupby('date')
                            .median()
                            .loc[df.loc[outliers, "date"], "perf"]
                            .values
                            )

df['price2'] = (df.groupby('type')['price'].transform(lambda x: x.iloc[0])).mul(df.groupby('type')['perf'].apply(lambda x: (1+x).cumprod()), fill_value=1) 
# New price with the same initial value of the prices but with perf corrected 

df.sort_values('type')

但最后它并不“好”。有没有办法通过函数来​​改进我的代码?

【问题讨论】:

    标签: python pandas dataframe group-by


    【解决方案1】:

    对平均数据帧执行直接.loc[] 查询怎么样?

    outliers = df.groupby('type')['price'].apply(lambda x: (x.pct_change(1).abs() >= 0.5))
    df_mean = df[~outliers].groupby('date').mean()
    
    fill_values = df_mean.loc[df.loc[outliers, "date"], "perf"].values
    df.loc[outliers, "perf"] = fill_values  # broadcast
    df.sort_values('type')
    Out[114]: 
             date type   price      perf
    0  2010-01-01   p1   100.0       NaN
    3  2010-01-04   p1   101.0  0.010000
    6  2010-01-05   p1   102.0  0.009901
    9  2010-01-06   p1   100.0 -0.019608
    12 2010-01-07   p1   100.0  0.000000
    1  2010-01-01   p2   100.0       NaN
    4  2010-01-04   p2    99.0 -0.010000
    7  2010-01-05   p2    98.0 -0.010101
    10 2010-01-06   p2   100.0  0.020408
    13 2010-01-07   p2   100.0  0.000000
    2  2010-01-01   p3  1000.0       NaN
    5  2010-01-04   p3  1000.0  0.000000
    8  2010-01-05   p3   100.0 -0.000100  <- replaced by mean
    11 2010-01-06   p3  1000.0  0.000400  <- replaced by mean
    14 2010-01-07   p3  1000.0  0.000000
    

    请注意,您的日期平均值 (df_mean) 已被 date 索引,并且似乎无法避免创建它。所以直接使用它的日期索引就好了。

    df_mean  
    Out[115]: 
                price    perf
    date                     
    2010-01-01  400.0     NaN
    2010-01-04  400.0  0.0000
    2010-01-05  100.0 -0.0001
    2010-01-06  100.0  0.0004
    2010-01-07  400.0  0.0000
    

    【讨论】:

      【解决方案2】:

      这应该可行。

      # Filter for outliers
      outliers = df['perf'].abs() >= 0.5
      
      # Create DataFrame for the mean of each date
      dt_mean = df.groupby('date')['perf'].mean().to_frame().copy()
      
      # Reset index
      dt_mean.reset_index(inplace=True) 
      
      # Set outliers equal to merger of outliers and mean DataFrame
      df.loc[outliers,'perf'] = list(pd.merge(df.loc[outliers, ['date', 'type', 'price']],dt_mean, on='date')['perf'])
      
          date       type price   perf
      0   2010-01-01  p1  100.0   NaN
      1   2010-01-01  p2  100.0   NaN
      2   2010-01-01  p3  1000.0  NaN
      3   2010-01-04  p1  101.0   0.010000
      4   2010-01-04  p2  99.0    -0.010000
      5   2010-01-04  p3  1000.0  0.000000
      6   2010-01-05  p1  102.0   0.009901
      7   2010-01-05  p2  98.0    -0.010101
      8   2010-01-05  p3  100.0   -0.300067
      9   2010-01-06  p1  100.0   -0.019608
      10  2010-01-06  p2  100.0   0.020408
      11  2010-01-06  p3  1000.0  3.000267
      12  2010-01-07  p1  100.0   0.000000
      13  2010-01-07  p2  100.0   0.000000
      14  2010-01-07  p3  1000.0  0.000000
      

      【讨论】:

      • 这不是我所需要的。我想用每个日期的平均值替换异常值,这就是我使用df[~outliers].groupby('date').mean() 的原因。我想在我的数据中替换这些值(我的代码中的groupby很重要)
      猜你喜欢
      • 1970-01-01
      • 2021-08-17
      • 1970-01-01
      • 2019-03-14
      • 2018-01-05
      • 2021-04-13
      • 1970-01-01
      • 2013-01-23
      • 1970-01-01
      相关资源
      最近更新 更多