将 2 个数据框按多列合并，如果至少一列匹配，则保留一行答案

【问题标题】：Merge 2 data frames by multiple columns, keep a row if there is a match in at least one column将 2 个数据框按多列合并，如果至少一列匹配，则保留一行
【发布时间】：2023-03-22 00:40:02
【问题描述】：

我有 2 个数据框 df_1 和 df_2。它们有 3 个共同点：permno、cusip 和ticker。 df_1 的每一行都是唯一的股票。 df_1 中的 permno、cusip 和 ticker 用于识别 df_2 中的股票收益。有时这些变量中的一个或两个不可用，但在每一行中，至少三个变量中的一个可用。我将使用该值在 df_2 中查找返回值。

如果在 permno、cusip 或 ticker 三列中的至少一列中有匹配项，您能否建议任何（快速）合并 df_1 和 df_2 的方法。

df_1

id  permno  cusip  ticker
1   1       11     AA
2   NA      12     NA
3   2       13     NA
4   5       NA     NA

df_2

permno  cusip  ticker  return  date
1       11     NA      100     date_1
7       15     BX      102     date_2
2       NA     CU      103     date_3

想要的结果

id  permno  cusip  ticker  return  date
1   1       11     AA      100     date_1
1   1       11     NA      100     date_1
3   2       13     NA      103     date_3
3   2       NA     CU      103     date_3

【问题讨论】：

我尝试了 merge(df_1, df_2) 但它没有产生我想要的结果。我也尝试过 merge(df_1, df_2, all.x = TRUE)
我想获得每只股票（在 df_1 中）的回报（在 df_2 中）。我可以使用 permno、cusip 或 ticker 来连接 2 个表。在 df_1 的每一行中，每个变量都是唯一的，可用于引用唯一的股票。但有时三个变量中的一些（permno、cusip 和ticker）不可用（但至少其中一个可用）

标签： r merge

【解决方案1】：

这应该可行。

# define common columns in both data frames 
colmatch <- c("permno", "cusip", "ticker")

# function to trim down data frame A to just those with rows
# that have at least one match in common column with data frame B
# and append columns from B which are not found in A
simplify <- function(df1, df2, col = colmatch) {
  # find all common column elements that matches
  idx <- sapply(col, function(x)
    match(df1[[x]], df2[[x]], incomparables=NA)
  )

  # find rows in first data frame with at least one match
  idx1 <- which(apply(idx, 1, function(x) !all(is.na(x))))

  # find corresponding rows in second data frame
  idx2 <- apply(idx[idx1, ], 1, function(x) x[min(which(!is.na(x)))])

  # copy columns from second data frame to first data frame
  # only for rows which matches above
  dff <- cbind(df1[idx1, ], df2[idx2, !(names(df2) %in% colmatch), drop=F])
}


# assemble the final output
df_final <- rbind(simplify(df_1, df_2),  # find df_1 rows with matches in df_2
                  simplify(df_2, df_1))  # and vice versa

最终输出（如果您喜欢按id 排序）

> df_final[order(df_final$id), ]
   id permno cusip ticker return   date
1   1      1    11     AA    100 date_1
11  1      1    11   <NA>    100 date_1
3   3      2    13   <NA>    103 date_3
31  3      2    NA     CU    103 date_3

【讨论】：

如果这满足您的要求，您可以接受它作为答案。