r中的选择性左连接答案

【问题标题】：Selective left join in rr中的选择性左连接
【发布时间】：2021-09-13 04:12:38
【问题描述】：

我想根据联合列和行的条件选择性地左连接两个数据框。

我看到一些类似的帖子使用了fuzzyjoin和sqldf，但是我发现的前面的例子和我的不太一样。

示例 dfs：

df1 <- data.frame(id = c("1", "2", "3"),
              zipcode = c("11111", "44444", "33333"),
              exp.id = c("0", "0", "1"))
df2 <- data_frame(zipcode = c("11111", "22222", "33333", "44444", "55555"),
              pct = c("0.1", "0.5", "0.9", "0.7", "0.8"))

基本上，我想通过邮政编码将df2中的“pct”列加入df1，但只加入其中“exp.id”=“0”

我期望的结果应该是这样的：

  id    zipcode exp.id pct  
 <chr> <chr>   <chr>  <chr>
1 1     11111   0      0.1  
2 2     44444   0      0.7  
3 3     33333   1      NA

提前谢谢你。

【问题讨论】：

标签： r sqldf fuzzyjoin

【解决方案1】：

1) 左连接 df1 和 zipcode 上的 df2 但仅连接 exp.id 为 0 的行。对于其他行 pct 为 NA 如预期问题中显示的结果。请注意，点是 SQL 运算符，因此我们用方括号将 exp.id 括起来以转义名称。

library(sqldf)

sqldf("select a.id, a.zipcode, b.pct
  from df1 a 
  left join df2 b on a.zipcode = b.zipcode and [exp.id] = 0")
##   id zipcode  pct
## 1  1   11111  0.1
## 2  2   44444  0.7
## 3  3   33333 <NA>

2) 这与 (1) 类似，但仅返回为零的 exp.id 行。这与问题中要求的不同，但有评论表明它很有趣。

此处的代码与 (1) 之间的差异说明了在 on 和 where 中包含条件之间的细微差别。因为在这种情况下我们有一个简单的条件，我们可以使用using 子句代替on。 using 会产生一个 zipcode，因此我们不需要区分 a.zipcode 和 b.zipcode。

sqldf("select a.id, zipcode, b.pct
  from df1 a left join df2 b using(zipcode)
  where [exp.id] = 0")
##   id zipcode pct
## 1  1   11111 0.1
## 2  2   44444 0.7

请注意，SQL 引擎在内部创建查询计划以优化计算，同时保持相同的输出。它不一定按写入的顺序执行操作，即它不一定执行连接然后减少结果，但可能会首先减少 df1 以提高性能，因为这给出了相同的结果。我们在下面显示有关查询计划的信息，我们看到它确实首先扫描了df1。

sqldf("explain query plan select a.id, zipcode, b.pct
      from df1 a left join df2 b using(zipcode)
      where [exp.id] = 0")
##   id parent notused                                                           detail
## 1  3      0       0                                              SCAN TABLE df1 AS a
## 2 16      0       0 SEARCH TABLE df2 AS b USING AUTOMATIC COVERING INDEX (zipcode=?)

【讨论】：

【解决方案2】：

加入数据并将pct 值转换为NA 其中exp.id != 0。

library(dplyr)

res <- df1 %>%
        left_join(df2, by = 'zipcode') %>%
        mutate(pct = replace(pct, exp.id != 0, NA))

res

#  id zipcode exp.id  pct
#1  1   11111      0  0.1
#2  2   44444      0  0.7
#3  3   33333      1 <NA>

在基础 R 中 -

res <- transform(merge(df1, df2, by = 'zipcode', all.x = TRUE), 
                 pct = replace(pct, exp.id != 0, NA))

您也只能加入exp.id = 0 的值。

df1 %>%
  filter(exp.id == 0) %>%
  left_join(df2, by = 'zipcode') %>%
  right_join(df1)

【讨论】：

感谢您的回复！实际上，我有大量不需要加入 pct 的 exp.id。我想知道是否有任何方法可以让 exp.id = 0 加入，而不是将其他人转为 NA。
我已经编辑了答案以表明这一点，但您也可以使用 %in% 忽略 exp.id 中的多个值。例如 - vals <- c(0, 2, 3, 4) 并在第一个答案中使用 replace(pct, exp.id !%in% vals, NA) 而在第二个答案中使用 filter(exp.id %in% vals)。