Modin df iterrows 非常缓慢。有什么办法可以加快速度吗？答案

【问题标题】：Modin df iterrows is painfully slow. Any alternative to speed it up?Modin df iterrows 非常缓慢。有什么办法可以加快速度吗？
【发布时间】：2021-04-23 05:43:01
【问题描述】：

我有一个具有 ~120k 行的 modin 数据框。我想合并它的一些列。 Modin df iterrows 需要很多时间，所以我尝试使用 numpy.where。 Numpy.where 在等效的 pandas df 上可以在 5-10 分钟内完成，但 modin df 上的相同操作需要约 30 分钟。对于 modin 数据帧，有什么替代方法可以加快这项任务的速度吗？

[cols_to_be_coalesced] --> 此列表包含要合并的列的列表。它包含 10-15 列。

代码：

for COL in [cols_to_be_coalesced]:
    df['COL'] = np.where(df['COL']!='', df['COL'], df['COL_X'])

如果 df 是 pandas 数据帧，它会在大约 10 分钟内执行，但如果它是一个 modin 数据帧，则需要大约 30 分钟。那么是否有任何等效的 numpy.where 代码用于 modin 数据帧以加快此操作？

【问题讨论】：

试试 - np.where(df['COL'] .values!='', df['COL'] .values, df['COL_X'] .values)
@Nk03 我尝试了您的建议，但没有任何区别，花了 1 个小时才完成。 numpy-pandas 在 4 分钟内完成。
[cols_to_be_coalesced] 的长度是多少。如果它很大，那么您应该考虑对其进行矢量化。
@Nk03 - 该列表包含 15-20 列。让我解释一下 - 我正在一个一个地合并 5 个数据集。每次合并后，都会发生上述操作。 5 次合并后的总记录约为 120k。因此，每次合并后，需要使用上述代码合并大约 15 到 20 列。 Numpy-pandas 只需要 5 分钟。但是使用 modin，它需要 50 分钟。你能告诉我如何为 modin 向量化它吗？
那么，这意味着 for 循环正在运行 15-20 次迭代吗？如果你使用多线程/多处理，那么你可以将这个 for 循环加速大约 15 倍。

标签： python-3.x pandas dataframe modin

【解决方案1】：

我认为您的np.where 很慢，因为np.where 将Modin 数据帧转换为numpy 数组，而将Modin 数据帧转换为numpy 是slow。这个版本使用pandas.Series.where（不是Modin where 实现，因为还没有添加）对你来说更快吗？

for COL in [cols_to_be_coalesced]:
    df['COL'] = df['COL'].where(df['COL'] != '', df['COL_X'])

我发现该方法需要 1.58 秒，而本示例中的原始方法需要 70 秒：

import modin.pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(0, 100, size=(2**20, 2**8))).add_prefix("col")
# setting column with np.where takes 70 seconds
df['col1'] = np.where(df['col1'] % 2 == 0, df['col1'], df['col2'])
# setting column with pandas.Series.where takes 1.58 seconds
df['col1'] = df['col1'].where(df['col1'] % 2 == 0, df['col2'])

【讨论】：