根据其他 2 个值索引 DataFrame 的更好解决方案答案

【问题标题】：Better solution to index a DataFrame according to the values of 2 others根据其他 2 个值索引 DataFrame 的更好解决方案
【发布时间】：2020-12-26 01:07:54
【问题描述】：

我想根据其他 2 个的值（列的 val1_df 和行的 val2_df）的值来索引 DataFrame (aaxx_df)。

>>> val1_df = pd.DataFrame({"A": ["aa", "bb"], "B": ["bb", "aa"]})

    A   B
0  aa  bb
1  bb  aa

>>> val2_df = pd.DataFrame({"A": ["xx", "yy"], "B": ["zz", "yy"]})

    A   B
0  xx  zz
1  yy  yy

>>> aaxx_df = pd.DataFrame(
...     {"aa": [10, 20, 30], "bb": [-10, -20, -30]}, 
...     index=["xx", "yy", "zz"]
... )

    aa  bb
xx  10 -10
yy  20 -20
zz  30 -30

>>> comb(val1_df, val2_df, aaxx_df)

      A     B
0  10.0 -30.0
1 -20.0  20.0

我在下面放了一个适合我的问题的解决方案，但我想，一定有一些更干净的解决方案，可能通过 SQL（在我看来这与关系数据库问题非常相似）。

def comb(df1, df2, link_df):

    comb_df = df1.copy()
    
    for val in link_df:
        tmp_df = df2.copy()
        tmp_df = tmp_df.applymap(lambda x: link_df.loc[x, val])
        comb_df = comb_df.replace({val: np.nan}).combine_first(tmp_df)

    return comb_df

【问题讨论】：

标签： python sql pandas dataframe combinations

【解决方案1】：

如果您想使用 val2_df (rows) 和 val1_df (cols) 中的值对 aaxx_df 进行索引，请使用 lookup:

vals = aaxx_df.lookup(val2_df.values.flatten(), val1_df.values.flatten())
    
res = pd.DataFrame(data=vals.reshape(val1_df.shape), columns=val1_df.columns)
print(res)

输出

    A   B
0  10 -30
1 -20  20

自 2020 年 12 月 26 日起，在 pandas 1.2 中，lookup 方法已弃用。这是作为函数的最小版本：

def lookup(df, row_labels, col_labels) -> np.ndarray:
    """
    Label-based "fancy indexing" function for DataFrame.
    Given equal-length arrays of row and column labels, return an
    array of the values corresponding to each (row, col) pair.

    Parameters
    ----------
    row_labels : sequence
        The row labels to use for lookup.
    col_labels : sequence
        The column labels to use for lookup.

    Returns
    -------
    numpy.ndarray
        The found values.
    """
    row_idx = df.index.get_indexer(row_labels)
    col_idx = df.columns.get_indexer(col_labels)
    flat_index = row_idx * len(df.columns) + col_idx
    return df.values.flat[flat_index]

代码变成这样：

vals = lookup(aaxx_df, val2_df.values.flatten(), val1_df.values.flatten())
    
res = pd.DataFrame(data=vals.reshape(val1_df.shape), columns=val1_df.columns)
print(res)

【讨论】：

确实，索引 aaxx_df 是表达我的问题的更好方式，我会更新问题。
自昨天（12 月 26 日）使用 pandas 1.2 以来，不幸的是，不推荐使用查找方法。所以我用最小版本的方法作为函数更新了答案。

【解决方案2】：

让我们做一行

out = (
    pd.concat([val1_df, val2_df])
    .groupby(level=0)
    .agg(tuple)
    .applymap(lambda x: aaxx_df.loc[x[1], x[0]])
)

    A   B
0  10 -30
1 -20  20

更新

s = aaxx_df.stack().reindex(
    pd.MultiIndex.from_arrays([val2_df.stack(), val1_df.stack()])
)
val1_df[:] = s.values.reshape(val1_df.shape)

    A   B
0  10 -30
1 -20  20

【讨论】：

这个解决方案也更干净，但不幸的是它和我的一样慢。
感谢@BENY 的努力 :) 它从 4.5 毫秒变为 1.9 毫秒，但 Dani 的回答要快一个数量级：230 µs

【解决方案3】：

此代码不太简洁，但创建字典并直接替换字典值可能比使用 lookup 更快。此外，您可以使用行+列组合的唯一字符串而不是元组来简化代码：

a = val2_df.values + '|' + val1_df.values.tolist()
rows = np.repeat(aaxx_df.index, len(aaxx_df.columns)).tolist()
cols = np.tile(aaxx_df.columns, len(aaxx_df)).tolist()
vals = aaxx_df.values.ravel().tolist()  # .tolist() makes code fasterfor dictionary conversion from numpy array
b = dict(zip(f'{rows}|{cols}', vals)) #this line of code is the bottleneck on this small dataframe, but it may have a high fixed cost and lower variable cost than lookup.
pd.DataFrame(a, columns=val1_df.columns).replace(b)

    A   B
0  10 -30
1 -20  20

有关创建“b”的代码，请在此处查看 Jezrael 的类似答案：https://stackoverflow.com/a/50505726/6366770

【讨论】：