如何在 Dataframe 的每一列中搜索异常？答案

【问题标题】：How can I search for anomalies in each column in a Dataframe?如何在 Dataframe 的每一列中搜索异常？
【发布时间】：2020-12-30 00:38:33
【问题描述】：

我有一个数据框，我的目标是找出每个不同列的异常情况。所以我正在寻找单变量异常。

假设这是我的数据框：

df=pd.DataFrame(np.random.rand(100, 6) * 1, columns=['A','B','C','D','E','F'])

我面临两个问题：

哪些算法适合这个目标？例如。隔离森林？
如何对所有列运行算法（例如隔离森林），而不是逐列执行？我可以使用 for 循环吗？

感谢您的帮助！

【问题讨论】：

这能回答你的问题吗？ Apply function on each column in a pandas dataframe
并非如此。如何使用 df.apply(function, axis=0) 进行异常检测？
您必须运行定义一个函数来检测 pd.Series（即一列）中的异常情况，然后使用 df.apply 在每一列上运行它
这超出了我的认知。如何定义检测异常的函数？
您可以做的一件简单的事情是找出与平均值相差 1.5 或 2 个标准差的值。这通常称为异常值检测。

标签： python dataframe for-loop anomaly-detection

【解决方案1】：

Q2：例如。

df = pd.DataFrame({"bytes":[1,2,3,4,5], "flow":[1,2,3,4,5], "userid":[1,2,3,4,5]}).set_index("userid")

def get_anomaly(arr):
    # your algorithm
    if arr.bytes < 3 and arr.flow < 3:
        return -1
    elif arr.bytes > 3 and arr.flow > 3:
        return 1
    else:
        return 0

df['is_anomaly'] = df.apply(get_anomaly, axis=1)

>>> df
   bytes  flow  userid  is_anomaly
0      1     1       1       -1
1      2     2       2       -1
2      3     3       3        0
3      4     4       4        1
4      5     5       5        1

我们可以谈谈第一季度。

0 级：线性关系或其他体验

Box-plot: min outlier < Q1-1.5ΔQ <= normal data <= Q3+1.5ΔQ < max outlier

Scott rule: Δb=3.5σn1/3 .Split the data and do distribution statistics

Other data status: avg. mean std and so on.

第 1 级：统计算法

Great algo: 
CMP
https://www.sciencedirect.com/science/article/abs/pii/S1389128616301633

Beehive
https://nds2.ccs.neu.edu/papers/Beehive.pdf

CBLOF
https://www.goldiges.de/publications/Anomaly_Detection_Algorithms_for_RapidMiner.pdf

And some AR MA ARMA algo, I don't know much.

第 2 级：无监督学习

Kmeans and so on...(This is actually quite a lot)

第 3 级：监督学习

from elasticsearch (doc)

EWMA  
s2=α*x2+(1-α)*s1

Holt-Linear  
s2=α*x2+(1-α)*(s1+t1)
t2=ß*(s2-s1)+(1-ß)*t1

Holt-Winters
si=α(xi-pi-k)+(1-α)(si-1+ti-1)
ti=ß(si-si-1)+(1-ß)ti-1
pi=γ(xi-si)+(1-γ)pi-k

from ML
CNN RNN LSTM Prefixspan AutoML Bayes and so on.(There are a few scenarios you can use.)

有太多的未列出，太多的算法要使用，太多的合适，太多的细节要写下来。 UEBA 的思维在分析异常时很重要。

【讨论】：

感谢您的建议。我要研究那个。不过，第 2 个问题仍未解决。
@Minfetli 已更新。