假设我们有以下两个数据框:
import pandas as pd
import numpy as np
df1 = pd.DataFrame(
{
"id": [1, 2, 3],
"month": ["Jan", "Mar", "Apr"],
"year": ["2022", "2020", "2021"],
"column_A": ["test", "test_", "test__"]
}
)
df2 = pd.DataFrame(
{
"id_name": [1, np.NaN, np.NaN],
"id_surname": [np.NaN, 2, np.NaN],
"id_first_name": [np.NaN, np.NaN, 3],
"month": ["Jan", "Mar", "Apr"],
"year": ["2022", "2020", "2021"],
"column_B": ["check", "check_", "check__"]
}
)
第二个数据框将是:
id_name id_surname id_first_name month year column_B
0 1.0 NaN NaN Jan 2022 check
1 NaN 2.0 NaN Mar 2020 check_
2 NaN NaN 3.0 Apr 2021 check__
您可以通过保留三列 id_name, id_surname, id_first_name 中的所有非 NaN 值来为第二个数据帧创建一个新列 id。从id_name 列开始,用id_surname 的非Nans 值填充其NaN,然后用id_first_name 的非NaN 填充剩余的NaN。这样做的代码是:
df2["id"] = df2["id_name"].fillna(df2["id_surname"]).fillna(df2["id_first_name"])
这将为df2 创建列id:
id_name id_surname id_first_name month year column_B id
0 1.0 NaN NaN Jan 2022 check 1.0
1 NaN 2.0 NaN Mar 2020 check_ 2.0
2 NaN NaN 3.0 Apr 2021 check__ 3.0
最后,您可以通过以下方式合并:
merged = pd.merge(
df1,
df2,
left_on=["id", "month", "year"],
right_on=["id", "month", "year"],
how="left",
)
结果将是:
id month year column_A id_name id_surname id_first_name column_B
0 1 Jan 2022 test 1.0 NaN NaN check
1 2 Mar 2020 test_ NaN 2.0 NaN check_
2 3 Apr 2021 test__ NaN NaN 3.0 check__