使用两个数组的 Pandas 查找表（第一个数组的索引，第二个数组的列）答案

【问题标题】：Pandas lookup table using two arrays (index from the first array, column from the second array)使用两个数组的 Pandas 查找表（第一个数组的索引，第二个数组的列）
【发布时间】：2021-12-31 14:20:17
【问题描述】：

我正在努力对使用 pandas 查找表的代码进行矢量化，其中索引由第一个数组中的值选择，列由第二个数组中的值选择。

假设我有两个 numpy 数组 a 和 b（它们的形状相同）：

codes = np.random.randint(1000, size=(4))
a_idx = np.random.randint(4, size=(6, 6))
a = codes[a_idx]
a[2, 1] = -999
a
Out[267]: 
array([[ 310,  310,   52,  310,  218,  310],
       [ 687,  310,  218,  310,  687,  687],
       [ 218, -999,  310,  218,   52,  687],
       [ 218,  218,  687,   52,  687,  310],
       [  52,  687,  687,   52,  687,  218],
       [  52,  218,   52,  687,  310,   52]])

b = np.random.randint(5, size=(6, 6))
b
Out[269]: 
array([[2, 4, 3, 2, 0, 4],
       [2, 4, 4, 2, 1, 0],
       [0, 0, 1, 1, 2, 0],
       [2, 2, 2, 2, 2, 1],
       [4, 1, 3, 1, 1, 2],
       [0, 3, 2, 2, 3, 0]])

我还有一个 pandas 查找表：

lookup = pd.DataFrame({'A': np.arange(1, 5),
                      'B': np.arange(11, 15),
                      'C': np.arange(21, 25)}, index=codes)
lookup.loc[-999] = 0
lookup
Out[275]: 
      A   B   C
 310  1  11  21
 687  2  12  22
 218  3  13  23
 52   4  14  24
-999  0   0   0

我已经为 pandas 列名创建了一个字典（不同的数字可以有相同的字母）：

b_dict = {0: 'A', 1: 'B', 2: 'C', 3: 'B', 4:'A'}

我想从查找表中创建第三个数组，其中索引由数组a 中的值选择，列从数组b 中选择（在b_dict 的帮助下）。这就是嵌套 for 循环的方式：

res = np.empty_like(a)
for i, (row_a, row_b) in enumerate(zip(a, b)):
    for j, (aij, bij) in enumerate(zip(row_a, row_b)):
        res[i, j] = lookup.loc[aij, b_dict[bij]]

这将是期望的结果：

res
Out[276]: 
array([[21,  1, 14, 21,  3,  1],
       [22,  1,  3, 21, 12,  2],
       [ 3,  0, 11, 13, 24,  2],
       [23, 23, 22, 24, 22, 11],
       [ 4, 12, 12, 14, 12, 23],
       [ 4, 13, 24, 22, 11,  4]])

对于使用 numpy 或 pandas 的大型数组，是否有更快（矢量化）的方法，我想避免嵌套循环？

编辑：我将示例更改为更接近实际问题。

【问题讨论】：

你应该可以只做lookup.values[a, b]。您可以搜索 numpy 高级索引以获取更多信息。
@Psidom 你的回答可以帮助 OP，并可能为他指明未来相关场景的正确方向
@Psidom，谢谢你的建议，看来我给的例子太简单了。不幸的是，我的实际数据对于简单的索引来说太复杂了。查找表的索引不是序数而是整数代码，查找表索引看起来更像2, 156, 45, 893, 17,...。数组 a 和 b 也有 nan 值，我将其替换为单个负值...
@NinoKrvavica 然后请使用该信息和更接近真实场景的稍微复杂的示例更新问题。我们可以假设a、b 和lookup 具有相同的形状，对吧？
@HarryPlotter，谢谢，我更新了示例，a 和 b 形状相同，但 lookup 形状不同。

标签： python arrays pandas dataframe numpy

【解决方案1】：

您可以展平两个数组并使用DataFrame.lookup 执行基于标签的查找，并将结果重塑为a 和b 的原始形状

row_labels = a.ravel()
col_labels = pd.Series(b_dict)[b.ravel()].to_numpy()   
res = lookup.lookup(row_labels, col_labels).reshape(a.shape)

使用与您的示例相同的设置

a = np.array([[ 310,  310,   52,  310,  218,  310],
              [ 687,  310,  218,  310,  687,  687],
              [ 218, -999,  310,  218,   52,  687],
              [ 218,  218,  687,   52,  687,  310],
              [  52,  687,  687,   52,  687,  218],
              [  52,  218,   52,  687,  310,   52]])

b = np.array([[2, 4, 3, 2, 0, 4],
              [2, 4, 4, 2, 1, 0],
              [0, 0, 1, 1, 2, 0],
              [2, 2, 2, 2, 2, 1],
              [4, 1, 3, 1, 1, 2],
              [0, 3, 2, 2, 3, 0]])

b_dict = {0: 'A', 1: 'B', 2: 'C', 3: 'B', 4:'A'}

lookup = pd.DataFrame({'A': [1, 2, 3, 4, 0], 
                       'B': [11, 12, 13, 14, 0], 
                       'C': [21, 22, 23, 24, 0]},
                       index=[310, 687, 218, 52, -999])

输出

>>> row_labels

array([ 310,  310,   52,  310,  218,  310,  687,  310,  218,  310,  687,
        687,  218, -999,  310,  218,   52,  687,  218,  218,  687,   52,
        687,  310,   52,  687,  687,   52,  687,  218,   52,  218,   52,
        687,  310,   52])

>>> col_labels

array(['C', 'A', 'B', 'C', 'A', 'A', 'C', 'A', 'A', 'C', 'B', 'A', 'A',
       'A', 'B', 'B', 'C', 'A', 'C', 'C', 'C', 'C', 'C', 'B', 'A', 'B',
       'B', 'B', 'B', 'C', 'A', 'B', 'C', 'C', 'B', 'A'], dtype=object)

>>> res

array([[21,  1, 14, 21,  3,  1],
       [22,  1,  3, 21, 12,  2],
       [ 3,  0, 11, 13, 24,  2],
       [23, 23, 22, 24, 22, 11],
       [ 4, 12, 12, 14, 12, 23],
       [ 4, 13, 24, 22, 11,  4]])

DataFrame.lookup 已被弃用，如docs 中所述。

按照in this official guide 的建议，更好的选择是使用pandas.factorize。

row_idx, row_labels = pd.factorize(a.ravel())
col_idx, col_labels = pd.factorize(pd.Series(b_dict)[b.ravel()].to_numpy())

res =(
    lookup.reindex(columns=col_labels, index=row_labels)   # reindex according to the encoding 
          .to_numpy()[row_idx, col_idx] # convert to 1D numpy array and use 'fancy index'
          .reshape(a.shape)  # reshape the 1D array to the original shape of a and b
)

输出

>>> row_idx

array([0, 0, 1, 0, 2, 0, 3, 0, 2, 0, 3, 3, 2, 4, 0, 2, 1, 3, 2, 2, 3, 1,
       3, 0, 1, 3, 3, 1, 3, 2, 1, 2, 1, 3, 0, 1])

>>> row_labels

array([ 310,   52,  218,  687, -999])

>>> col_idx

array([0, 1, 2, 0, 1, 1, 0, 1, 1, 0, 2, 1, 1, 1, 2, 2, 0, 1, 0, 0, 0, 0,
       0, 2, 1, 2, 2, 2, 2, 0, 1, 2, 0, 0, 2, 1])

>>> col_labels

array(['C', 'A', 'B'], dtype=object)

>>> res

array([[21,  1, 14, 21,  3,  1],
       [22,  1,  3, 21, 12,  2],
       [ 3,  0, 11, 13, 24,  2],
       [23, 23, 22, 24, 22, 11],
       [ 4, 12, 12, 14, 12, 23],
       [ 4, 13, 24, 22, 11,  4]])

【讨论】：

谢谢，这正是我想要的，而且效果很好。不幸的是，我收到一个警告，lookup 方法已被弃用。按照建议，我尝试使用DataFrame.melt 和Dataframe.loc，但没有成功。如果没有已弃用的lookup，有什么想法可以做到这一点吗？
@NinoKrvavica 你是对的！我用here建议的方法更新了答案。
@NinoKrvavica 能解决你的问题吗？
谢谢，是的，这解决了问题！

【解决方案2】：

正如 cmets 所指出的，lookup.values[a, b] 产生的结果与 res 在循环填充它之后产生的结果完全相同。

所以本质上不是这个：

res = np.empty_like(a)
for i, (row_a, row_b) in enumerate(zip(a, b)):
    for j, (aij, bij) in enumerate(zip(row_a, row_b)):
        res[i, j] = lookup.loc[aij, b_dict[bij]]

你可以这样做：

res = lookup.values[a, b]

输出：

>>> res
array([[ 4, 21,  4],
       [ 3,  4, 23],
       [23,  2, 11],
       [ 1, 13,  4]])

>>> lookup.values[a, b]
array([[ 4, 21,  4],
       [ 3,  4, 23],
       [23,  2, 11],
       [ 1, 13,  4]])

>>> res == lookup.values[a, b]
array([[ True,  True,  True],
       [ True,  True,  True],
       [ True,  True,  True],
       [ True,  True,  True]])

【讨论】：