pandas replace 的 Numpy 等效项（字典映射）答案

【问题标题】：Numpy equivalent of pandas replace (dictionary mapping)pandas replace 的 Numpy 等效项（字典映射）
【发布时间】：2021-08-12 09:11:43
【问题描述】：

我知道使用 numpy 数组可以比 pandas 更快。

我想知道是否有等效的方法（并且更快）在 numpy 数组上执行 pandas.replace。

在下面的示例中，我创建了一个数据框和一个字典。字典包含列的名称及其对应的映射。我想知道是否有任何函数可以让我将字典提供给 numpy 数组以进行映射并产生更快的处理时间？

import pandas as pd
import numpy as np

# Dataframe
d = {'col1': [1, 2, 3], 'col2': [4, 5, 6]}
df = pd.DataFrame(data=d)

# dictionary I want to map
d_mapping = {'col1' : {1:2 , 2:1} ,  'col2' : {4:1}}

# result using pandas replace
print(df.replace(d_mapping))

# Instead of a pandas dataframe, I want to perform the same operation on a numpy array
df_np =  df.to_records(index=False)

【问题讨论】：

看看：stackoverflow.com/questions/16992713/…
related。我的直觉是用 numpy 打败 pandas 会很困难。
@AnuragDabas 谢谢！我确实看过那个场景，那个场景将相同的字典应用于整个矩阵。对于我来说，我想为不同的列使用不同的字典

标签： pandas numpy replace mapping

【解决方案1】：

你可以试试np.select()。我相信这取决于要替换的独特元素的数量。

def replace_values(df, d_mapping):
    def replace_col(col):
        # extract numpy array and column name from pd.Series
        col, name = col.values, col.name
        # generate condlist and choicelist
        # for every key in mapping create a boolean mask
        condlist = [col == x for x in d_mapping[name].keys()]
        choicelist = d_mapping[name].values()
        # use np.where to keep the existing value which won't be replaced 
        return np.select(condlist, choicelist, col)

    return df.apply(replace_col)

用法：

replace_values(df, d_mapping)

我还相信，如果您在映射中使用列表/数组而不是 dicts 并将 keys() 和 values() 调用替换为索引查找，您可以加快上面的代码：

d_mapping = {"col1": [[1, 2], [2, 1]], "col2": [[4], [1]]}
...
lookups and are also expensive
m = d_mapping[name]
condlist = [col == x for x in m[0]]
choicelist = m[1]
...
np.isin(col, m[0]),

更新：

这是基准

import pandas as pd
import numpy as np

# Dataframe
df = pd.DataFrame({"col1": [1, 2, 3], "col2": [4, 5, 6]})

# dictionary I want to map
d_mapping = {"col1": [[1, 2], [2, 1]], "col2": [[4], [1]]}
d_mapping_2 = {
    col: dict(zip(*replacement)) for col, replacement in d_mapping.items()
}


def replace_values(df, mapping):
    def replace_col(col):
        col, (m0, m1) = col.values, mapping[col.name]
        return np.select([col == x for x in m0], m1, col)

    return df.apply(replace_col)


from timeit import timeit

print("np.select: ", timeit(lambda: replace_values(df, d_mapping), number=5000))
print("df.replace: ", timeit(lambda: df.replace(d_mapping_2), number=5000))

在我用了 6 年的笔记本电脑上打印：

np.select:  3.6562702230003197
df.replace:  4.714512745998945

np.select 快 20% 左右

【讨论】：