有没有可以去除异常值的功能？答案

【问题标题】：Is there function that can remove the outliers?有没有可以去除异常值的功能？
【发布时间】：2019-12-01 08:06:54
【问题描述】：

我找到了一个从列中检测异常值的函数，但我不知道如何删除异常值

是否有从列中排除或删除异常值的功能

这是检测异常值的函数，但我需要一个函数来删除异常值

import numpy as np
import pandas as pd
outliers=[]
def detect_outlier(data_1):

    threshold=3
    mean_1 = np.mean(data_1)
    std_1 =np.std(data_1)


    for y in data_1:
        z_score= (y - mean_1)/std_1 
        if np.abs(z_score) > threshold:
            outliers.append(y)
    return outliers

这里是打印异常值

#printing the outlier 
outlier_datapoints = detect_outlier(df['Pre_TOTAL_PURCHASE_ADJ'])
print(outlier_datapoints)

【问题讨论】：

标签： python pandas outliers

【解决方案1】：

def outlier():
    import pandas as pd
    df1=pd.read_csv("......\\train.csv")
    _, bp = pd.DataFrame.boxplot(df1, return_type='both')
    outliers = [flier.get_ydata() for flier in bp["fliers"]]
    out_liers = [i.tolist() for i in outliers]

【讨论】：

【解决方案2】：

一个简单的解决方案是使用scipy.stats.zscore

from scipy.stats import zscore
# calculates z-score values
df["zscore"] = zscore(df["Pre_TOTAL_PURCHASE_ADJ"]) 

# creates `is_outlier` column with either True or False values, 
# so that you could filter your dataframe accordingly
df["is_outlier"] = df["zscore"].apply(lambda x: x <= -1.96 or x >= 1.96)

【讨论】：

感谢您的快速回复我在此代码中遇到错误 df["is_outlier"] = df["zscore"].apply(x: x = 1.96)
@swe2010 很高兴能帮上忙。对于我的论文研究，我也需要计算 z-score 值哈哈。可能有更好的方法，但这对我来说很好。
@swe2010 顺便说一句，不要忘记接受正确的答案，这样这篇文章就不会显示为“未回答”

【解决方案3】：

我认为“删除异常值”是指“从 df 数据框中删除在 'Pre_TOTAL_PURCHASE_ADJ' 列中包含异常值的行。”如果这是不正确的，也许您可以修改问题以明确您的意思。

样本数据也很有帮助，而不是强迫潜在的回答者自己制定。

避免遍历数据框的行通常效率更高。对于行选择，所谓的Boolean array indexing 是实现目标的快速方法。由于您已经有一个谓词（返回真值的函数）来识别您要排除的行，因此您可以使用这样的谓词来构建另一个仅包含异常值的数据框，或者（通过否定谓词）只有非异常值。

由于@political_scientist 已经给出了一个实用的解决方案，使用scipy.stats.zscore 在新的is_outlier 列中生成谓词值，我将把这个答案作为在numpy 和pandas 中工作的简单通用建议。鉴于该答案，您想要的行将由

给出

df[~df['is_outlier']]

虽然在生成选择器列而不是在索引中包含否定 (~) 可能更容易理解，但重命名列 'is_not_outlier'。

【讨论】：

【解决方案4】：

这里有 2 种方法用于一维数据集。

第 1 部分：使用上限和下限到 3 个标准差

import numpy as np

# Function to Detection Outlier on one-dimentional datasets.
anomalies = []
def find_anomalies(data):
    # Set upper and lower limit to 3 standard deviation
    data_std = np.std(data)
    data_mean = np.mean(data)
    anomaly_cut_off = data_std * 3

    lower_limit = data_mean - anomaly_cut_off 
    upper_limit = data_mean + anomaly_cut_off

    # Generate outliers
    for outlier in data:
        if outlier > upper_limit or outlier < lower_limit:
            anomalies.append(outlier)
    return anomalies

第 2 部分：使用 IQR（四分位距）

q1, q3= np.percentile(data,[25,75]) # get percentiles
iqr = q3 - q1 # the IQR value
lower_bound = q1 - (1.5 * iqr) # lower bound
upper_bound = q3 + (1.5 * iqr) # upper bound

np.sum(data > upper_bound) # how many datapoints are above the upper bound?

【讨论】：