检测异常时能知道是哪一列导致异常吗？答案

【问题标题】：Can it be learned which column caused the anomaly in detecting anomaly?检测异常时能知道是哪一列导致异常吗？
【发布时间】：2021-04-22 19:18:09
【问题描述】：

我正在尝试使用半监督机器学习来查找测试数据中的异常情况。假设我们有如下数据。此数据未标记，此数据是用于异常检测的训练数据。这里所有的值都是正常的。（不包含异常值）

column1   column2   column3   column4   column5   column6   column7   column8   
10        15        35        20        41        78        32        45
74        41        45        41        42        32        31        41
15        10        12        11        12        13        14        12

和测试数据：

column1   column2   column3   column4   column5   column6   column7   column8   
1800      15        35        20        41        78        32        45
74        41        45        41        42        32        31        41
15        10        12        11        12        13        14        12

模型可能会说第一行有异常。在多列数据集中考虑这一点。有什么办法可以得到如下的打印输出？

异常情况在第一行。并且它在名为column1的列的第一行中的值是异常情况的原因。

【问题讨论】：

如果你做一个单变量（按列）异常检测器，它会自然地出现。

标签： python pandas dataframe anomaly-detection

【解决方案1】：

我不知道您是否需要模型来执行此操作，或者您已经有了要训练的算法。无论如何，如果所有值都是数字并且您假设正态分布，您可以使用 3 sigma 规则，在这种情况下，平均值 +/- 3*sigma（标准偏差）中的所有内容都应该是您数据的 99.7% 左右。因此，如果某个数字超出该集合，则可能是异常的。我不明白的另一件事是您对“行”和“行”的区别，我假设它们是相同的。

这是我想出的（我是新手，所以可能有更好的方法）：

mean = train_df.stack().mean()
std = train_df.stack().std()

inflim = mean  - 3*std
suplim = mean + 3*std

columns = test_df.columns.tolist()

    
ordinal = lambda n: "%d%s" % (n,"tsnrhtdd"[(n//10%10!=1)*(n%10<4)*n%10::4])
    
for column in columns:
    list1 = test_df[(test_df.idpozo < inflim ) | 
            (test_df.idpozo > suplim )].index.tolist()
    for item in list1:
        print("The abnormal condition is in the {} line,".format(ordinal(item)) + 
               "and its value in the {} row".format(ordinal(item)) + 
               "of the column named {}".format(column) +
               "is the cause of the abnormal condition.")

【讨论】：