【问题标题】:How do I change / avoid NA's in column with dplyr Joins?如何使用 dplyr Joins 更改/避免列中的 NA?
【发布时间】:2021-10-03 16:25:27
【问题描述】:

我正在尝试弄清楚如何在 dplyr 中使用 Joins。当我使用 full_join 加入 ab 时,我得到了 4 个 statefips 缺失值的状态。

  1. 有没有更好的加入方法,可以完全避免这个问题,而且不会丢失任何数据?
  2. 加入 ab 后可以添加 statefips(真正的 a 和 b 包含 4000+ 行)吗?
library(tidyverse)


# create df's
a <- tibble::tribble(
                                                        ~statename, ~statefips,        ~date, ~emp,
                                                         "Alabama",          1, "2020-01-14",    2,
                                                      "California",          6, "2020-01-14",    2,
                                                         "Alabama",          1, "2020-01-15",    2,
                                                      "California",          6, "2020-01-15",    2,
                                                         "Alabama",          1, "2020-01-16",    3,
                                                      "California",          6, "2020-01-16",    3,
                                                         "Alabama",          1, "2020-01-17",    3,
                                                      "California",          6, "2020-01-17",    3,
                                                         "Alabama",          1, "2020-01-18",    4,
                                                      "California",          6, "2020-01-18",    4,
                                                         "Alabama",          1, "2020-01-19",    4,
                                                      "California",          6, "2020-01-19",    4,
                                                         "Alabama",          1, "2020-01-20",    4,
                                                      "California",          6, "2020-01-20",    5,
                                                         "Alabama",          1, "2020-01-21",    5,
                                                      "California",          6, "2020-01-21",    5,
                                                         "Alabama",          1, "2020-01-22",    5,
                                                      "California",          6, "2020-01-22",    5,
                                                         "Alabama",          1, "2020-01-21",    5,
                                                      "California",          6, "2020-01-21",    4,
                                                         "Alabama",          1, "2020-01-22",    4,
                                                      "California",          6, "2020-01-22",    4,
                                                         "Alabama",          1, "2020-01-23",    4,
                                                      "California",          6, "2020-01-23",    4,
                                                         "Alabama",          1, "2020-01-24",    4,
                                                      "California",          6, "2020-01-24",    4
                                                      )
b <- tibble::tribble(
                                                        ~statename,        ~date, ~ui_claims,
                                                         "Alabama", "2020-01-04",      "0.5",
                                                      "California", "2020-01-04",      "0.5",
                                                         "Alabama", "2020-01-11",      "0.5",
                                                      "California", "2020-01-11",      "2.5",
                                                         "Alabama", "2020-01-18",      "2.5",
                                                      "California", "2020-01-18",      "1.5"
                                                      )
# Join a and b
full_join <- full_join(a, b, by = c("statename", "date")) %>% arrange(date)

# my try to fix missing NA's (doesn't work)

state_id <- tibble::tribble(
                                                        ~statename, ~statefips,
                                                         "Alabama",          1,
                                                      "California",          6
                                                      )

full_join_fix <- full_join(full_join, state_id, by = "statename") %>% arrange(date)

【问题讨论】:

  • 如果缺少某些内容,它将不会出现在连接表中。也许你想做 left_join 或 inner_join。
  • full_join 不会按照我的理解删除任何数据?它只会在需要的地方添加 NA..
  • 是的。左乔恩保留左表中的所有行,内连接仅保留两个表中的行

标签: r dplyr tidyverse


【解决方案1】:

我不太确定,如果这是您正在寻找的,但在 full_join 之后我们可以 arrange 然后 fill

library(dplyr)
library(tidyr)

a %>% 
  full_join(b, by = c("statename", "date")) %>% 
  arrange(statename) %>% 
  fill(statefips, .direction = "down") %>% 
  print(n=40)
   statename  statefips date         emp ui_claims
   <chr>          <dbl> <chr>      <dbl> <chr>    
 1 Alabama            1 2020-01-14     2 NA       
 2 Alabama            1 2020-01-15     2 NA       
 3 Alabama            1 2020-01-16     3 NA       
 4 Alabama            1 2020-01-17     3 NA       
 5 Alabama            1 2020-01-18     4 2.5      
 6 Alabama            1 2020-01-19     4 NA       
 7 Alabama            1 2020-01-20     4 NA       
 8 Alabama            1 2020-01-21     5 NA       
 9 Alabama            1 2020-01-22     5 NA       
10 Alabama            1 2020-01-21     5 NA       
11 Alabama            1 2020-01-22     4 NA       
12 Alabama            1 2020-01-23     4 NA       
13 Alabama            1 2020-01-24     4 NA       
14 Alabama            1 2020-01-04    NA 0.5      
15 Alabama            1 2020-01-11    NA 0.5      
16 California         6 2020-01-14     2 NA       
17 California         6 2020-01-15     2 NA       
18 California         6 2020-01-16     3 NA       
19 California         6 2020-01-17     3 NA       
20 California         6 2020-01-18     4 1.5      
21 California         6 2020-01-19     4 NA       
22 California         6 2020-01-20     5 NA       
23 California         6 2020-01-21     5 NA       
24 California         6 2020-01-22     5 NA       
25 California         6 2020-01-21     4 NA       
26 California         6 2020-01-22     4 NA       
27 California         6 2020-01-23     4 NA       
28 California         6 2020-01-24     4 NA       
29 California         6 2020-01-04    NA 0.5      
30 California         6 2020-01-11    NA 2.5 

【讨论】:

  • 谢谢!这适用于此示例数据。如果我有 50 个州名,它会同样工作吗?
  • 是的,这也应该有效。在arrange 之后,您可以轻松fill
猜你喜欢
  • 2016-05-31
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2022-11-15
  • 1970-01-01
  • 1970-01-01
相关资源
最近更新 更多