将 NaN 与 pandas 数据框不等式保持一致答案

【问题标题】：Keeping NaNs with pandas dataframe inequalities将 NaN 与 pandas 数据框不等式保持一致
【发布时间】：2021-10-05 02:17:08
【问题描述】：

我有一个 pandas.DataFrame 对象，其中包含大约 100 列和 200000 行数据。我正在尝试将其转换为布尔数据框，其中 True 表示该值大于阈值，False 表示它小于阈值，并且保持 NaN 值。

如果没有 NaN 值，我运行大约需要 60 毫秒：

df >= threshold

但是当我尝试处理 NaN 时，以下方法有效，但速度很慢（20 秒）。

def func(x):
    if x >= threshold:
        return True
    elif x < threshold:
        return False
    else:
        return x
df.apply(lambda x: x.apply(lambda x: func(x)))

有没有更快的方法？

【问题讨论】：

尝试将您的func 替换为以下行：return x >= threshold if x is not None else x，它可能会更快。顺便说一句，你为什么分配两个lambda x？ df.apply(func) 会成功的。
@DeepSpace 花了同样的时间

标签： python pandas

【解决方案1】：

你可以这样做：

new_df = df >= threshold
new_df[df.isnull()] = np.NaN

但这与使用 apply 方法得到的不同。在这里，您的掩码具有包含 NaN、0.0 和 1.0 的 float dtype。在应用解决方案中，您会得到 object dtype，其中包含 NaN、False 和 True。

两者都不能用作面具，因为你可能得不到你想要的。 IEEE 表示任何 NaN 比较都必须产生 False，并且 apply 方法通过返回 NaN 隐式违反了这一点！

最好的选择是单独跟踪 NaN，安装瓶颈时 df.isnull() 非常快。

【讨论】：

【解决方案2】：

您可以使用此帖子单独检查 NaN：Python - find integer index of rows with NaN in pandas

df.isnull()

使用按位或将isnull 的输出与df >= threshold 组合：

df.isnull() | df >= threshold

您可以预计这两个掩码需要接近 200 毫秒的时间来计算和组合，但这应该足够远离 20 秒才可以。

【讨论】：

您对如何将它们结合起来有什么想法吗？这也是我认为我需要走的路。
这对我不起作用。我在 python 2.7.1，pandas 0.17.0（我通常使用的）中尝试过，得到了 NotImplementedError，然后我在 python 3.4.3，pandas 0.17.0 中尝试过，得到：'bitwise_or' not supported for the input type
尝试改用np.logical_or(df.isnull(), df >= threshold)。这是我的时间记事本：nbviewer.ipython.org/gist/ocefpaf/4539348e5ed71f7fe94f
好的，我错过了NaN values are maintained 部分。这不漂亮，它仍然很慢（但比应用更快）：df = df_nans >= thresholddf[df_nans.isnull()] = np.NaN

【解决方案3】：

另一种选择是使用掩码：

df.mask(~df.isna(), df >= threshold)

这只会将条件应用于非 nan 值，而保持 nan 值不变

【讨论】：

【解决方案4】：

在这种情况下，我使用浮点指标数组，编码为：0=False、1=True 和 NaN=missing。具有 bool dtype 的 Pandas DataFrame 不能有缺失值，而具有 object dtype 的 DataFrame 包含 Python bool 和 float 对象的组合效率不高。这导致我们使用带有np.float64 dtype 的DataFrame。 numpy.sign(x - threshold) 给出 -1 = (x 阈值) 供您比较，这对于您的目的可能已经足够了，但如果您真的需要 0/ 1编码，就地转换即可。下面的时间是在一个 200K 长度的数组 x:

In [45]: %timeit y = (x > 0); y[pd.isnull(x)] = np.nan
100 loops, best of 3: 8.71 ms per loop

In [46]: %timeit y = np.sign(x)
100 loops, best of 3: 1.82 ms per loop

In [47]: %timeit y = np.sign(x); y += 1; y /= 2
100 loops, best of 3: 3.78 ms per loop

【讨论】：

我应该提到，上述所有三种方法都为您提供了一个 dtype np.float64 的 DataFrame y，并且都保留了 NaN。第二种方法为 False/True 提供 -1/1 编码，其他方法为 0/1 编码。 y = (1 + np.sign(x)) / 2 也很有竞争力。
当完全相等时，这可能无法满足您的需求。如果x == threshold，np.sign(x - threshold) 将为 0，因此在最终结果中，如果 x 阈值，则为 1。如果平等是可能的，您可以选择y = (1 + np.sign(eps + x - threshold)) / 2，其中eps = np.finfo(np.float64).eps。