在二进制条目数据帧上计算行模式的最快方法答案

【问题标题】：Fastest way of computing a row mode on binary entries dataframe在二进制条目数据帧上计算行模式的最快方法
【发布时间】：2021-01-24 13:24:07
【问题描述】：

我正在尝试优化一段代码，该代码可以找到具有布尔条目的数据帧的 row 模式。这里的行模式不是每列的模式，而是重复次数最多的行向量。

我有一种工作方式：

some_binary_entry_dataframe = pd.DataFrame(pd.np.random.rand(10,300) < 0.5)
pd.util.hash_pandas_object(some_binary_entry_dataframe, index=False).mode()

但是我发现这个任务很快就很慢，对于形状为20x300 的数据框大约需要 100 毫秒。它已成为我代码的瓶颈。这在 pandas 或 numpy 中如何优化？

编辑 1：我希望代码实现的更详细示例，我正在尝试过滤掉与大多数（模式）不匹配的行

entries = pd.np.zeros((3,3))
entries[1:,0] = 1
# entries = [[0.0, 0.0, 0.0], [1.0, 0.0, 0.0], [1.0, 0.0, 0.0]]
__df = pd.DataFrame(entries.astype(bool))
row_hashes = pd.util.hash_pandas_object(__df, index=False)
mask = row_hashes.isin(row_hashes.mode())
__df = __df[mask]
# __df.values.astype(int) = [[1, 0, 0], [1, 0, 0]]

分析后，CPU 主要忙于调用pd.util.hash_pandas_object，因此我尝试对其进行优化。

编辑 2：我已经用 __df.apply(lambda x : hash(tuple(x)), axis=1) 替换了散列，并获得了不错的加速。

【问题讨论】：

我无法让pd.util.hash_pandas_object(some_binary_entry_dataframe, index=False).mode() 工作。您可以添加示例数据框和预期的 o/p 吗？
@Divakar 我添加了一行来生成一些测试数据。稍后我可能会添加输入/输出的相关示例。

标签： python-3.x pandas numpy optimization

【解决方案1】：

这种更“手动”的方法似乎要快得多：

from collections import Counter
import numpy as np

def binary_mode_mask_counter(a):
    a = np.asarray(a)
    cols = a.shape[1]
    # Convert every row into a big integer value
    h = np.array([sum(int(v) << i for i, v in enumerate(r)) for r in a], dtype=object)
    # Count frequencies
    c = Counter(h)
    # Get most frequent values
    _, max_count = c.most_common(1)[0]
    ms = [m for m, n in c.items() if n == max_count]
    # Return mask
    return np.isin(h, ms)

比较：

import numpy as np
import pandas as pd

# Original method
def binary_mode_mask_pd(a):
    h = pd.util.hash_pandas_object(a, index=False)
    m = h.mode()
    return h.isin(m)

# Benchmark
np.random.seed(0)
a = pd.DataFrame(np.random.rand(20, 300) < 0.5)
%timeit binary_mode_mask_counter(a)
# 2 ms ± 44.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit binary_mode_mask_pd(a)
# 81.8 ms ± 1.49 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

【讨论】：