【发布时间】:2021-03-03 08:13:27
【问题描述】:
我正在对包含购物中心入口传感器数据的数据集进行异常/异常值检测,并且有多个入口。我已经能够在孤立的入口上测试一些异常方法,但我正在努力为所有这些方法实施。
这是数据的sn-p:
import pandas as pd
import numpy as np
df = pd.DataFrame({"mall": ["Mall1", "Mall1", "Mall1", "Mall1", "Mall1", "Mall1", "Mall1", "Mall1", "Mall1", "Mall1", "Mall1", "Mall1"],
"entrance": ["West", "West","West","West","West", "West", "East", "East", "East", "East", "East", "East"],
"in": [132, 140, 163, 142, 133, 150, 240, 250, 233, 234, 2000, 222]})
我使用的一种方法是 Z-score 方法,它根据观测值与平均值的标准差来检测异常值。
##Z-SCORE
out=[]
def Zscore_outlier(df):
m = np.mean(df)
sd = np.std(df)
for i in df:
z = (i-m)/sd
if np.abs(z) > 3: #(1=68.3%, 2=95.4%, 3=99.73%, 4=99.99%)
out.append(i)
print("Outliers:",out)
Zscore_outlier(df['in'])
#find rows of outliers
print(df[df['in'].isin(out)])
#count outliers
len(out)
我希望它一次性在每个入口上运行,在一个循环中获取所有入口的 Z-score 方法的输出。我把函数放在 for 循环之外,而是在里面调用它。我在 entrance 列上使用 groupby。输出只是给了我循环的最后一个入口,所以入口“东”两次。这是我的代码:
def Zscore_outlier(df):
out=[]
m = np.mean(df)
sd = np.std(df)
for i in df:
z = (i-m)/sd
if np.abs(z) > 3: #(1=68.3%, 2=95.4%, 3=99.73%, 4=99.99%)
out.append(i)
print("Outliers:",out)
by_label = df.groupby('entrance')
for name, group in by_label:
Zscore_outlier(df['in'])
#find rows of outliers
print(df[df['in'].isin(out)])
#count outliers
len(out)
OUTPUT:
Outliers: [2000]
mall entrance in
10 Mall1 East 2000
Outliers: [2000]
mall entrance in
10 Mall1 East 2000
【问题讨论】:
-
尝试使用:pandas.pydata.org/pandas-docs/stable/reference/api/… 计算每个组的平均值和标准差。这将帮助你完全不用循环
标签: python pandas numpy for-loop