如何找到两列有较大差异/异常值的地方python答案

【问题标题】：How to find where two columns have bigger difference/ are outliers python如何找到两列有较大差异/异常值的地方python
【发布时间】：2021-11-23 05:39:28
【问题描述】：

我有这两个数组：（创建了两个随机示例数组）

x = [5,12,24,44,22,32,22]
y = [8,14,26,47,44,35,23]

这两列是相关的，x[4] 和 y[4] 是该数据的异常值

我将如何遍历数据框并返回其中包含异常值的列或列号？

编辑：道歉。这是数据框：

df = pd.DataFrame({'x':x, 'y':y})

【问题讨论】：

在这种情况下，您的“异常值”只是每个列表中的最大值。你是这个意思吗？也许它比这更复杂。此外，如果确实需要，您需要将这些列表加载到数据框中
您提到了一个 DataFrame，但您正在显示列表，请提供真实的输入。将列标记为异常值的标准是什么？最后，你能提供预期的输出吗？
@BrutusForcus - 它们都不是最大值。 22 似乎接近该列的平均值，这是一个不寻常的异常值
@BrutusForcus 在 x[4] 我们可以看到这与 y[4] 异常不同。所以我想成为输出的列。
如何量化“异常”？

标签： python arrays pandas numpy outliers

【解决方案1】：

去除异常值的方法不是一种，而是几十种。

我想说，鉴于 x 和 y 之间的线性关系，最好先绘制数据，然后理性地决定如何去除异常值

这里的关系显然是线性的。我对scipy.stats.siegelslopes 使用了稳健的线性回归来获得稳健的拟合线。

我绘制了各种异常值去除方法。拟合斜率的 ±10% 和中位数差的 ±10 倍。相比之下，@MichaelSzczesny 提出的（有效）方法相当于右侧的阈值约为 15 的方法（我使用了 6）。

import matplotlib.pyplot as plt
from scipy.stats import siegelslopes

f, (ax1, ax2) = plt.subplots(ncols=2)

xs = np.arange(0, 50)
slope, intercept = siegelslopes(df['y'], df['x'])
ax1.plot(xs, slope*xs+intercept, ls='--')
ax2.plot(xs, slope*xs+intercept, ls='--')

### variation of slope

# keep points with slope variation < 10%
df1 = df[np.log10(df['y']/(df['x']*slope+intercept)).lt(0.1)]
df1.plot.scatter('x', 'y', c='k', ax=ax1)

# plot ± 10%
ax1.plot(xs, slope*1.1*xs+intercept, c='grey', ls=':')
ax1.plot(xs, slope*0.9*xs+intercept, c='grey', ls=':')

# plot outliers
df.drop(df1.index).plot.scatter('x', 'y', c='r', ax=ax1)

ax1.set_ylim(ymin=0)
ax1.set_xlim(xmin=0

### keep points with intercept variation ± 10 * median x-y difference

d = abs(df['y']-(df['x']*slope+intercept))
thresh = d.median()*10
df1 = df[d.lt(thresh)]
df1.plot.scatter('x', 'y', c='k', ax=ax2)

# plot ± threshold
ax2.plot(xs, slope*xs+intercept+thresh, c='grey', ls=':')
ax2.plot(xs, slope*xs+intercept-thresh, c='grey', ls=':')

# plot outliers
df.drop(df1.index).plot.scatter('x', 'y', c='r', ax=ax2)

ax2.set_ylim(ymin=0)
ax2.set_xlim(xmin=0)

【讨论】：

【解决方案2】：

也许这太简单了，但似乎满足了简短的要求：-

x = [5,12,24,44,22,32,22]
y = [8,14,26,47,44,35,23]
d = [abs(_x - _y) for _x, _y in zip(x, y)]
i = d.index(max(d))
print(x[i], y[i])

【讨论】：

下划线是干什么的，我用的时候说找不到列？
尝试复制和粘贴。前导下划线没有特殊含义，尽管我个人的偏好是用它来表示临时变量（在类上下文之外，它通常可以用来表示“私有”成员）