Pandas：每行最大值的矢量化操作答案

【问题标题】：Pandas : vectorized operations on maximum values per rowPandas：每行最大值的矢量化操作
【发布时间】：2016-06-20 08:36:04
【问题描述】：

我有以下熊猫数据框df：

index        A    B    C
    1        1    2    3
    2        9    5    4
    3        7    12   8
    ...      ...  ...  ...

我希望每行的最大值保持不变，而所有其他值都变为-1。因此，输出将如下所示：

index        A    B    C
    1       -1   -1    3
    2        9   -1   -1
    3       -1    12  -1
    ...      ...  ...  ...

通过使用df.max(axis = 1)，我得到了一个熊猫Series，每行的最大值。但是，我不确定如何以最佳方式使用这些最大值来创建我需要的结果。我正在寻找一种矢量化的快速实现。

【问题讨论】：

标签： python pandas max dataframe vectorization

【解决方案1】：

考虑使用where：

>>> df.where(df.eq(df.max(1), 0), -1)
       A   B  C
index          
1     -1  -1  3
2      9  -1 -1
3     -1  12 -1

这里df.eq(df.max(1), 0) 是一个布尔数据框，标记行最大值；真值（最大值）保持不变，而假值变为-1。如果您愿意，也可以使用 Series 或其他 DataFrame 来代替标量。

该操作也可以就地完成（通过传递inplace=True）。

【讨论】：

这是更简洁的答案。

【解决方案2】：

您可以通过将eq 与max 逐行比较来创建布尔值mask，然后应用倒置mask：

print df
       A   B  C
index          
1      1   2  3
2      9   5  4
3      7  12  8

print df.max(axis=1)
index
1     3
2     9
3    12
dtype: int64

mask = df.eq(df.max(axis=1), axis=0)
print mask
           A      B      C
index                     
1      False  False   True
2       True  False  False
3      False   True  False

df[~mask] = -1
print df
       A   B  C
index          
1     -1  -1  3
2      9  -1 -1
3     -1  12 -1

大家一起：

df[~df.eq(df.max(axis=1), axis=0)] = -1
print df
       A   B  C
index          
1     -1  -1  3
2      9  -1 -1
3     -1  12 -1

【讨论】：

【解决方案3】：

创建一个与df 大小相同的新数据框，每个值都包含-1。然后使用 enumerate 获取给定行中的 first 最大值，使用整数获取/设置标量 (iat)。

df2 = pd.DataFrame(-np.ones(df.shape), columns=df.columns, index=df.index)

for row, col in enumerate(np.argmax(df.values, axis=1)):
    df2.iat[row, col] = df.iat[row, col]

>>> df2
   0   1  2
0 -1  -1  3
1  9  -1 -1
2 -1  12 -1

时间安排

df = pd.DataFrame(np.random.randn(10000, 10000))

%%timeit
df2 = pd.DataFrame(-np.ones(df.shape))
for row, col in enumerate(np.argmax(df.values, axis=1)):
    df2.iat[row, col] = df.iat[row, col]
1 loops, best of 3: 1.19 s per loop

%timeit df.where(df.eq(df.max(1), 0), -1)
1 loops, best of 3: 6.27 s per loop

# Using inplace=True
%timeit df.where(df.eq(df.max(1), 0), -1, inplace=True)
1 loops, best of 3: 5.58 s per loop

%timeit df[~df.eq(df.max(axis=1), axis=0)] = -1
1 loops, best of 3: 5.65 s per loop

【讨论】：

我认为你丢失了DataFrame 的索引和列，但如果 OP 不需要它，你的答案就赢了。
是的，我从来没有设置它们开始。我已经编辑了上面df2 的构造。
还要注意df.where(df.eq(df.max(1), 0), -1)返回一个new DataFrame；其他两种方法修改现有的 DataFrame。在我的机器上，通过inplace=True 使使用where 更快一点（虽然只是一点点）。
(另外，使用np.argmax(df.values, axis=1)只会得到每行中第一个最大值的索引，所以其他最大值将变为-1。这可能会也可能会不是 OP 想要的，但可能值得标记。）
是的，这就是它更快的原因之一。考虑到 OP 对速度的要求并且缺乏需要在给定行上重复最大值的细节，我冒昧地这样做了。