使用 groupby 实现 for 循环答案

【问题标题】：For loop implementation with groupby使用 groupby 实现 for 循环
【发布时间】：2021-03-03 08:13:27
【问题描述】：

我正在对包含购物中心入口传感器数据的数据集进行异常/异常值检测，并且有多个入口。我已经能够在孤立的入口上测试一些异常方法，但我正在努力为所有这些方法实施。

这是数据的sn-p：

import pandas as pd
import numpy as np
df = pd.DataFrame({"mall": ["Mall1", "Mall1", "Mall1", "Mall1", "Mall1", "Mall1", "Mall1", "Mall1", "Mall1", "Mall1", "Mall1", "Mall1"],
                     "entrance": ["West", "West","West","West","West", "West", "East", "East", "East", "East", "East", "East"],
                     "in": [132, 140, 163, 142, 133, 150, 240, 250, 233, 234, 2000, 222]})

我使用的一种方法是 Z-score 方法，它根据观测值与平均值的标准差来检测异常值。

##Z-SCORE
out=[]
def Zscore_outlier(df):
    m = np.mean(df)
    sd = np.std(df)
    for i in df: 
        z = (i-m)/sd
        if np.abs(z) > 3: #(1=68.3%, 2=95.4%, 3=99.73%, 4=99.99%)
            out.append(i)
    print("Outliers:",out)
Zscore_outlier(df['in'])

#find rows of outliers
print(df[df['in'].isin(out)])

#count outliers
len(out)

我希望它一次性在每个入口上运行，在一个循环中获取所有入口的 Z-score 方法的输出。我把函数放在 for 循环之外，而是在里面调用它。我在 entrance 列上使用 groupby。输出只是给了我循环的最后一个入口，所以入口“东”两次。这是我的代码：

def Zscore_outlier(df):
    out=[]
    m = np.mean(df)
    sd = np.std(df)
    for i in df: 
        z = (i-m)/sd
        if np.abs(z) > 3: #(1=68.3%, 2=95.4%, 3=99.73%, 4=99.99%)
            out.append(i)
    print("Outliers:",out)

by_label = df.groupby('entrance')
    
for name, group in by_label:
    Zscore_outlier(df['in'])

    #find rows of outliers
    print(df[df['in'].isin(out)])

    #count outliers
    len(out)

OUTPUT:
Outliers: [2000]
     mall entrance    in
10  Mall1     East  2000
Outliers: [2000]
     mall entrance    in
10  Mall1     East  2000

【问题讨论】：

尝试使用：pandas.pydata.org/pandas-docs/stable/reference/api/… 计算每个组的平均值和标准差。这将帮助你完全不用循环

标签： python pandas numpy for-loop

【解决方案1】：

import pandas as pd
import numpy as np
df = pd.DataFrame({"mall": ["Mall1", "Mall1", "Mall1", "Mall1", "Mall1", "Mall1", "Mall1", "Mall1", "Mall1", "Mall1", "Mall1", "Mall1"],
                     "entrance": ["West", "West","West","West","West", "West", "East", "East", "East", "East", "East", "East"],
                     "in": [132, 140, 163, 142, 133, 150, 240, 250, 233, 234, 2000, 222]})




# solution itself
df['group_mean'] = df.groupby('entrance')['in'].transform(np.mean)
df['group_std'] = df.groupby('entrance')['in'].transform(np.std)
df['z'] = (df['in'] - df['group_mean']) / df['group_std']
# I've taken 2, as with 3 it is not an outlier
df['outlier'] = df['z'].abs() > 2

df[df['outlier'] == True]

【讨论】：