pandas：从大型 DataFrame 中过滤掉几行（异常值）的有效方法答案

【问题标题】：pandas: efficient way to a filter out a few rows (outliers) from a large DataFramepandas：从大型 DataFrame 中过滤掉几行（异常值）的有效方法
【发布时间】：2020-06-22 11:39:50
【问题描述】：

我正在寻找一种从大型 DataFrame 中过滤掉几行（异常值）的有效方法。根据https://pandas.pydata.org/pandas-docs/stable/getting_started/comparison/comparison_with_sql.html#delete，建议是选择应该保留的行。这是一个示例 DataFrame -

In [288]: dai                                                                                        
Out[288]: 
                   x   y
frame face lmark        
1     NaN  NaN   NaN NaN
2     NaN  NaN   NaN NaN
3     NaN  NaN   NaN NaN
4     NaN  NaN   NaN NaN
5     NaN  NaN   NaN NaN
...               ..  ..
5146  NaN  NaN   NaN NaN
5147  NaN  NaN   NaN NaN
5148  NaN  NaN   NaN NaN
5149  NaN  NaN   NaN NaN
5150  NaN  NaN   NaN NaN

[312814 rows x 2 columns]

其索引已排序 -

In [295]: dai.equals(dai.sort_index())                                                               
Out[295]: True

现在我提取 frame 索引的唯一排序值，除了最后一个（第 5150 帧）-

n [305]: frames = dai.index.get_level_values('frame').drop_duplicates().sort_values()[:-1]          

In [306]: frames                                                                                     
Out[306]: 
Int64Index([   1,    2,    3,    4,    5,    6,    7,    8,    9,   10,
            ...
            5140, 5141, 5142, 5143, 5144, 5145, 5146, 5147, 5148, 5149],
           dtype='int64', name='frame', length=5149)

然后使用.loc过滤DataFrame中的行

In [307]: dai.loc[frames]                                                                            
Out[307]: 
                   x   y
frame face lmark        
1     NaN  NaN   NaN NaN
2     NaN  NaN   NaN NaN
3     NaN  NaN   NaN NaN
4     NaN  NaN   NaN NaN
5     NaN  NaN   NaN NaN
...               ..  ..
5145  NaN  NaN   NaN NaN
5146  NaN  NaN   NaN NaN
5147  NaN  NaN   NaN NaN
5148  NaN  NaN   NaN NaN
5149  NaN  NaN   NaN NaN

结果是正确的，但花费的时间比预期的要长 -

In [308]: timeit dai.loc[frames]                                                                     
7.31 s ± 83.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [309]: prun -l 4 dai.loc[frames]                                                                  
         1159551 function calls (1138939 primitive calls) in 7.753 seconds

   Ordered by: internal time
   List reduced from 253 to 4 due to restriction <4>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     5148    3.544    0.001    3.544    0.001 base.py:241(_outer_indexer)
    10298    1.963    0.000    1.963    0.000 {method 'searchsorted' of 'numpy.ndarray' objects}
    10298    0.811    0.000    0.900    0.000 base.py:1588(is_monotonic_increasing)
     5149    0.413    0.000    0.413    0.000 {method 'nonzero' of 'numpy.ndarray' objects}

有什么方法可以提高性能吗？

【问题讨论】：

标签： pandas performance dataframe indexing

【解决方案1】：

我发现使用默认 RangeIndex 过滤 DataFrame 比使用 multiIndex 快得多

【讨论】：