去除异常值的方法不是一种,而是几十种。
我想说,鉴于 x 和 y 之间的线性关系,最好先绘制数据,然后理性地决定如何去除异常值
这里的关系显然是线性的。我对scipy.stats.siegelslopes 使用了稳健的线性回归来获得稳健的拟合线。
我绘制了各种异常值去除方法。拟合斜率的 ±10% 和中位数差的 ±10 倍。相比之下,@MichaelSzczesny 提出的(有效)方法相当于右侧的阈值约为 15 的方法(我使用了 6)。
import matplotlib.pyplot as plt
from scipy.stats import siegelslopes
f, (ax1, ax2) = plt.subplots(ncols=2)
xs = np.arange(0, 50)
slope, intercept = siegelslopes(df['y'], df['x'])
ax1.plot(xs, slope*xs+intercept, ls='--')
ax2.plot(xs, slope*xs+intercept, ls='--')
### variation of slope
# keep points with slope variation < 10%
df1 = df[np.log10(df['y']/(df['x']*slope+intercept)).lt(0.1)]
df1.plot.scatter('x', 'y', c='k', ax=ax1)
# plot ± 10%
ax1.plot(xs, slope*1.1*xs+intercept, c='grey', ls=':')
ax1.plot(xs, slope*0.9*xs+intercept, c='grey', ls=':')
# plot outliers
df.drop(df1.index).plot.scatter('x', 'y', c='r', ax=ax1)
ax1.set_ylim(ymin=0)
ax1.set_xlim(xmin=0
### keep points with intercept variation ± 10 * median x-y difference
d = abs(df['y']-(df['x']*slope+intercept))
thresh = d.median()*10
df1 = df[d.lt(thresh)]
df1.plot.scatter('x', 'y', c='k', ax=ax2)
# plot ± threshold
ax2.plot(xs, slope*xs+intercept+thresh, c='grey', ls=':')
ax2.plot(xs, slope*xs+intercept-thresh, c='grey', ls=':')
# plot outliers
df.drop(df1.index).plot.scatter('x', 'y', c='r', ax=ax2)
ax2.set_ylim(ymin=0)
ax2.set_xlim(xmin=0)