【问题标题】:Pandas: remove outliers to replace the NaN with the meanPandas:删除异常值以用均值替换 NaN
【发布时间】:2021-06-22 07:10:41
【问题描述】:

我有一个 3454 行的 h3_15min

     Timestamp              KE          EH   LA     PR          AG
0    2013-02-27 00:00:00    1.000000    2.0  0.03   201.289993  4.36
1    2013-02-27 00:15:00    0.990000    2.0  0.03   210.070007  4.38
2    2013-02-27 00:30:00    0.950000    2.0  0.02   207.779999  4.35
3    2013-02-27 00:45:00    0.990000    2.0  0.03   151.960007  4.34
4    2013-02-27 01:00:00    341.209991  2.0  0.04   0.000000    4.41
... ... ... ... ... ... ...
3449 2013-04-03 22:15:00    NaN         2.0  0.03   0.000000    NaN
3450 2013-04-03 22:30:00    NaN         NaN  0.07   0.000000    NaN
3451 2013-04-03 22:45:00    NaN         NaN  NaN    0.000000    NaN
3452 2013-04-03 23:00:00    NaN         NaN  NaN    0.000000    NaN
3453 2013-04-03 23:15:00    NaN         NaN  NaN    0.000000    NaN

就是这样描述的

        KE          EH          LA          PR          AG
count   3439.000000 3450.000000 3451.000000 3454.000000 3416.000000
mean    7.361526    60.447796   20.266174   17.185938   506.416779
std     48.624306   286.459686  168.753860  59.658848   623.306396
min     0.000000    2.000000    0.000000    0.000000    4.010000
25%     0.970000    2.000000    0.020000    0.000000    170.047501
50%     0.990000    2.000000    0.040000    0.000000    245.834991
75%     0.990000    2.000000    0.080000    0.000000    526.140015
max     652.210022  2199.290039 2214.550049 278.029999  3543.469971

我想删除异常值,以便计算平均值并替换 NaN 值。

我尝试使用以下源自 [this][1] 帖子的代码:

h3_15min[(np.abs(stats.zscore(h3_15min.loc[:, h3_15min.columns != "Timestamp" ])) < 3)]

这导致 h3_15min 有 3272 行"

        Timestamp           KE          EH  LA      PR          AG
3       2013-02-27 00:45:00 0.990000    2.0 0.03    151.960007  4.34
4       2013-02-27 01:00:00 341.209991  2.0 0.04    0.000000    4.41
5       2013-02-27 01:15:00 1.000000    2.0 0.02    0.000000    4.29
6       2013-02-27 01:30:00 0.990000    2.0 0.04    0.000000    4.19
7       2013-02-27 01:45:00 0.990000    2.0 0.01    0.000000    4.15
... ... ... ... ... ... ...
3449    2013-04-03 22:15:00 NaN        2.0  0.03    0.000000    NaN
3450    2013-04-03 22:30:00 NaN        NaN  0.07    0.000000    NaN
3451    2013-04-03 22:45:00 NaN        NaN  NaN     0.000000    NaN
3452    2013-04-03 23:00:00 NaN        NaN  NaN     0.000000    NaN
3453    2013-04-03 23:15:00 NaN        NaN  NaN     0.000000    NaN

似乎它没有删除最大异常值,而只是删除了一些随机行。

对于 KE,异常值 > 1,对于 EH > 2,对于 LA > 1,对于 PR > 300。关于如何删除数据框的异常值而不必为每一列手动输入的任何想法?我的另一个数据集有 50 列,如果可以自动完成,那就太好了。 [1]:Detect and exclude outliers in Pandas data frame

【问题讨论】:

    标签: python pandas dataframe nan outliers


    【解决方案1】:
    outliers = (h3_15min.KE > 1) & (h3_15min.EH > 2) & (h3_15min.LA > 1) & (h3_15min.PR > 300)
    no_outliers = h3_15min.loc[~outliers]
    

    应该可以解决问题。

    【讨论】:

    • 感谢您的建议。不幸的是,当我运行代码时,异常值仍然存在。
    猜你喜欢
    • 2021-04-13
    • 2016-01-08
    • 2013-09-12
    • 2022-10-05
    • 2013-04-01
    • 2019-03-21
    • 1970-01-01
    相关资源
    最近更新 更多