在可用的非空键上加入 pandas DataFrames答案

【问题标题】：Join pandas DataFrames on available non-null keys在可用的非空键上加入 pandas DataFrames
【发布时间】：2021-05-19 15:45:09
【问题描述】：

我有一个“基础”DataFrame a，其中包含一个标识符 seq 和连接键，以及一个将被合并的“值”DataFrame b：

import numpy as np
import pandas as pd

a = pd.DataFrame({
    'id':range(11),
    'seq':[
        1, 1, 1, 2, 2, 3, 3, 3, 4, 5, 5],
    'class':[
        'alpha', 'beta', 'gaga', np.nan, np.nan, np.nan, np.nan, np.nan,
        np.nan, 'beta', 'beta'],
    'style':[
        'x', 'x', 'x', 'y', 'y', np.nan, np.nan, np.nan, np.nan, np.nan,
        np.nan],
    'drama':[
        'no', 'no', 'no', 'yes', 'oh yes', 'no', 'yes', 'oh yes', np.nan,
        'yes', 'oh yes']})

b = pd.DataFrame({
    'class':[
        'gaga', 'alpha', 'alpha', 'alpha', 'alpha', 'alpha', 'beta', 'beta',
        'alpha', 'gaga', 'beta', 'beta', 'alpha', 'beta', 'gaga', 'alpha',
        'beta', 'alpha', 'beta', 'gaga', 'gaga', 'beta', 'beta', 'beta',
        'gaga'],
    'style':[
        'y', 'y', 'y', 'x', 'x', 'x', 'y', 'x', 'x', 'x', 'y', 'y', 'y', 'y',
        'x', 'y', 'y', 'y', 'y', 'x', 'x', 'x', 'y', 'x', 'y'],
    'drama':[
        'yes', 'no', 'no', 'no', 'no', 'oh yes', 'oh yes', 'oh yes', 'oh yes',
        'no', 'yes', 'oh yes', 'no', 'no', 'yes', 'yes', 'no', 'yes', 'oh yes',
        'oh yes', 'oh yes', 'no', 'oh yes', 'yes', 'yes'],
    'start':[
        838, 727, 700, 840, 530, 507, 871, 585, 120, 164, 562, 750, 953, 733,
        337, 307, 277, 972, 3, 805, 539, 600, 8, 382, 147],
    'end':[
        198, 328, 591, 427, 151, 126, 132, 149, 856, 725, 608, 726, 178, 521,
        316, 154, 633, 4, 113, 881, 258, 32, 354, 259, 958]})

我想做的是创建一个函数，该函数允许对b 进行某种“递归”连接，该函数将只连接每行可用的非空列。在a 的情况下：

其中class、style 和drama 不为空，将加入三个键；
如果class 为空，将仅加入style 和drama；
如果class 和style 都为空，将仅在drama 上加入；
如果所有三列都为空，将在整个值 DataFrame b 上“加入”；
空列不一定相同：例如，如果style 为空，它将加入class 和drama。

就结果而言，在此示例中，输出将与手动执行不那么智能的操作相同：

ll = []
x, y, z, w, v = a.loc[:2], a.loc[3:4], a.loc[5:7], a.loc[8:8], a.loc[9:]

ll.append(x.merge(b, on=['class', 'style', 'drama']))
ll.append(y.drop(columns='class').merge(b, on=['style', 'drama']))
ll.append(z.drop(columns=['class', 'style']).merge(b, on='drama'))
ll.append(v.drop(columns='style').merge(b, on=['class', 'drama']))

# For all null values, get the entire DataFrame
w['placeholder'] = 1
b['placeholder'] = 1
ll.append(w
    .drop(columns=['class', 'style', 'drama'])
    .merge(b, on='placeholder')
    .drop(columns='placeholder'))

result = pd.concat(ll)

但是，在这种情况下，手动操作是可能的，因为我事先已经知道如何隔离“组”（x、y、z、w 和 v）以及哪些列我将用于每个中的合并操作。

我用非常有限的可用性实现了一半，并且在我看来使用了一种低于标准的方式来处理列：

def recjoin(base: pd.DataFrame, other: pd.DataFrame, keys: list) -> pd.DataFrame:
    missing_cols = base.columns[base.isnull().any()]

    if len(missing_cols) == 0:
        result = base.merge(other, on=idx)
    else:
        nonmissing = base.dropna(subset=keys)
        result_nonmissing = nonmissing.merge(other, on=keys)

        id_missing = base.index.difference(nonmissing.index)
        missing = base.loc[id_missing].drop(columns=missing_cols)

        if isinstance(keys, str):
            keys = [keys]

        alt_keys = list(pd.Index(keys).difference(missing_cols))
        result_missing = missing.merge(other, on=alt_keys)

        result = pd.concat([nonmissing, missing])

    return result

这样，如果a.loc[:4] 被传递给base，它会起作用，但如果它是a.loc[:7]，则不会，因为在第二种情况下，NaN 列的数量是可变的：

In [1]: a.loc[:4]
Out[1]:
   id  seq  class style   drama
0   0    1  alpha     x      no
1   1    1   beta     x      no
2   2    1   gaga     x      no
3   3    2    NaN     y     yes
4   4    2    NaN     y  oh yes

In [2]: a.loc[:7]
Out[2]:
   id  seq  class style   drama
0   0    1  alpha     x      no
1   1    1   beta     x      no
2   2    1   gaga     x      no
3   3    2    NaN     y     yes
4   4    2    NaN     y  oh yes
5   5    3    NaN   NaN      no
6   6    3    NaN   NaN     yes
7   7    3    NaN   NaN  oh yes

在这种情况下，最好的方法是什么，以免我们落入iterrows 解决方案？

【问题讨论】：

实际情况下两个数据框有多大？
a 有大约 1000 行，而 b 通常有 600k 行。目前，我通过“糟糕”实现获得的最大输出是 360 万行，但平均约为 70 万行。

标签： python pandas join

【解决方案1】：

这是一个想法，使用itertools.combinations 在列中创建所有可能的不同长度组合以用于merge。然后对于每个循环，您需要从 a 中选择行，这些行是要删除的列，而不是要合并的行（在上面的示例中您手动执行的操作）。最后，concat所有数据框

from itertools import combinations

# define the colms to 
cols = set(['class', 'style', 'drama'])
l = []
for i in range(0, len(cols)+1):
    for comb in combinations(cols, i):
        cols_drop = list(comb)
        cols_merge = list(cols-set(comb))
        # get all the rows with nan for all columns to drop 
        # and notna for the columns to merge
        m = a[cols_merge].notna().all(1) & a[cols_drop].isna().all(1)
#         print (cols_drop, cols_merge) # if you want to understand what is happening
#         print(a[m])                   #  on row selections
        l.append(a[m].drop(cols_drop, axis=1).assign(placeholder=1)
                     .merge(b.assign(placeholder=1), on=cols_merge+['placeholder']))

res = (
    pd.concat(l, ignore_index=True)
      .drop('placeholder', axis=1)
)

你得到

print(res)
    id  seq  class style   drama  start  end
0    0    1  alpha     x      no    840  427
1    0    1  alpha     x      no    530  151
2    1    1   beta     x      no    600   32
3    2    1   gaga     x      no    164  725
4    3    2   gaga     y     yes    838  198
5    3    2   beta     y     yes    562  608
6    3    2  alpha     y     yes    307  154
7    3    2  alpha     y     yes    972    4
8    3    2   gaga     y     yes    147  958
9    4    2   beta     y  oh yes    871  132
10   4    2   beta     y  oh yes    750  726
11   4    2   beta     y  oh yes      3  113
12   4    2   beta     y  oh yes      8  354
13   9    5   beta     y     yes    562  608
14   9    5   beta     x     yes    382  259
...

【讨论】：

我真的应该更多地了解itertools，我总是惊讶地发现许多复杂的解决方案都可以通过 PSL 实现，尤其是这个包。
@manoelpqueiroz 是的 itertools 有很多很酷的方法:)

【解决方案2】：

解决方案基于对查询字符串的评估，因此我们需要更改列名class：

a = a.rename(columns={"class": "klass"})
b = b.rename(columns={"class": "klass"})

ACOLS = ["klass", "style", "drama"]
BCOLS = ["start", "end"]

为了匹配您的规则，我们需要选择空列的所有值：因此我们将所有 nan 值替换为 .*，然后为 a 的每一行创建查询字符串 qs。

查询数据框b以获取与过滤器匹配的所有行的列表列表，并在加入数据框a之前构建一个新的数据框：

# "klass.str.contains('^{}$') & style.str.contains('^{}$') & drama.str.contains('^{}$')"
QUERY = " & ".join(f"{c}.str.contains('^{{}}$')" for c in ACOLS)

qs = a[ACOLS].fillna(".*") \
             .apply(lambda c: QUERY.format(*c.tolist()), axis="columns")

data = qs.apply(lambda q: b.query(q, engine="python")[BCOLS].values).explode()
data = pd.DataFrame(data.tolist(), index=data.index, columns=BCOLS)

out = a.join(data).reset_index(drop=True)

>>> out
    id  seq  klass style   drama  start  end
0    0    1  alpha     x      no    840  427
1    0    1  alpha     x      no    530  151
2    1    1   beta     x      no    600   32
3    2    1   gaga     x      no    164  725
4    3    2    NaN     y     yes    838  198
..  ..  ...    ...   ...     ...    ...  ...
65  10    5   beta   NaN  oh yes    871  132
66  10    5   beta   NaN  oh yes    585  149
67  10    5   beta   NaN  oh yes    750  726
68  10    5   beta   NaN  oh yes      3  113
69  10    5   beta   NaN  oh yes      8  354

[70 rows x 7 columns]

查询字符串示例：

>>> a.loc[5, ACOLS]
klass    NaN
style    NaN
drama     no
Name: 5, dtype: object

>>> qs.loc[5]
"klass.str.contains('^.*$') and style.str.contains('^.*$') and drama.str.contains('^no$')"

【讨论】：

哇！我什至不知道 DataFrames 的这个功能，感觉它可以让 pandas 更加通用。但是连接操作不应该有效地“放大”原始数据帧吗？似乎您的结果只带来了第一个匹配的行，同时也从第二行开始将它们移动了 1：“beta-x-no”应该有 600，但那是在“gaga-x-no”上（反过来应该有164，但那是在“y-yes”上）。
@manoelpqueiroz。我修正了我的错误，请您在有时间的时候评估一下解决方案吗？