【问题标题】:Replacing NAs in list of dataframes with values from same column if values in another column are the same如果另一列中的值相同,则用同一列中的值替换数据框列表中的 NA
【发布时间】:2019-08-08 06:40:17
【问题描述】:

这可能是一个非常简单的问题,但我似乎无法弄清楚......

我有以下列表

l <- list(May=data.frame(date=c(NA, as.Date("2019/5/1"),  NA,  NA,  NA, NA, as.Date("2019/5/2"),  NA, NA, NA, NA, NA, NA, NA), ID = c( "107349", "110024", "6187"  , "100420", "94436",  "88995" , "110165" ,"91644",  "108508", "105213", "108773", "102636" ,"102339" ,"100413")),
        April = data.frame(date=c(as.Date("2019/4/1"), as.Date("2019/4/2"),  NA,  NA,  NA,  NA,  NA, NA, NA, NA,  NA, NA, as.Date("2019/4/3"), NA, as.Date("2019/4/4"),  NA, NA, NA, NA, NA), ID=c("37866",  "107349", "93051",  "6187",   "98274",  "100420", "94436",  "88995"  ,"105107", "105109", "91644",  "105103" ,"108508" ,"105213", "108773", "85409"  ,"104145","102636" ,"102339" ,"100413")),
        March = data.frame(date= c(NA, NA,  NA,  NA,  NA,  NA, NA, NA, NA,  NA, NA, as.Date("2019/3/1"),  NA, NA, NA, NA, NA, NA), ID=c("93051" , "104499" ,"6187",   "98274",  "100420" ,"94436",  "88995"  ,"105107" ,"105109", "91644"  ,"105103", "105213" ,"85409" , "104145", "100989", "102636" ,"102339", "100413")),
        February = data.frame(date= c(NA , NA, as.Date("2019/2/1"),  NA,  NA,  NA,  NA ,as.Date("2019/2/2"), as.Date("2019/2/3"), as.Date("2019/2/4"),  NA, as.Date("2019/2/5"),  NA ,NA, as.Date("2019/2/6"), NA, NA, NA, NA, NA, NA, NA), ID=c("94266" , "93051",  "104499" ,"6187" ,  "98274",  "100420", "94436"  ,"88995",  "105107", "105109", "91644"  ,"105103", "85409"  ,"102252", "104145", "94559",  "101426", "100992" ,"100989" ,"102636" ,"102339" ,"100413")),
        January = data.frame(date = seq(as.Date("2019/1/1"),  by = "day", length.out = 18), ID=c("94266" , "93051",  "99836",  "6187" ,  "98274",  "100420", "94436",  "91644",  "85409",  "102252", "94412",  "94559",  "101426", "100992", "100989", "102636", "102339", "100413")))

我正在尝试匹配一列(日期)中的特定值,如果另一列(ID)中的值相同,则将其替换为同一列中的相应值。 如果 ID 列中的相应 ID 值匹配,则日期列在所有数据帧中应该相同,但我只有 ID 首次出现的日期和 ID 的以下出现的 NA。

我尝试使用匹配和子集,但我无法弄清楚。

【问题讨论】:

  • 为什么你的数据如此不一致?二月数据框的日期看起来像17928,而一月数据框的日期看起来像2019-01-01。还有什么是 2 月数据框中的 `eRec`。
  • 也没有真正理解你在这里想要实现的目标。您能否举例说明您希望获得的最终输出是什么?
  • @Adam Quek:日期不一致是因为某些日期列以NA 开头而不是有效日期,并且该列中的其余日期转换为数字。
  • @JorisChau 谢谢!我不知道 CRAN 会自动将日期列强制转换为数字列。很高兴知道。

标签: r


【解决方案1】:

正如 OP 提到的尝试 matchsubset,这是另一种使用 subset 创建初始查找 data.frame 并使用 match 填充缺失值的方法:

lookup <- do.call("rbind", l)
lookup <- subset(lookup, !is.na(lookup$date))

lapply(l, function(x) { x$date <- lookup$date[match(x$ID, lookup$ID)]; x })
#> $May
#>          date     ID
#> 1  2019-04-02 107349
#> 2  2019-05-01 110024
#> 3  2019-01-04   6187
#> 4  2019-01-06 100420
#> 5  2019-01-07  94436
#> 6  2019-02-02  88995
#> 7  2019-05-02 110165
#> 8  2019-01-08  91644
#> 9  2019-04-03 108508
#> 10 2019-03-01 105213
#> 11 2019-04-04 108773
#> 12 2019-01-16 102636
#> 13 2019-01-17 102339
#> 14 2019-01-18 100413
#> 
#> ...

数据

请注意,数据已被修改,所有date 列都属于Date 类。

l <- list(May = structure(list(date = structure(c(NA, 18017, NA, NA, 
NA, NA, 18018, NA, NA, NA, NA, NA, NA, NA), class = "Date"), 
    ID = c("107349", "110024", "6187", "100420", "94436", "88995", 
    "110165", "91644", "108508", "105213", "108773", "102636", 
    "102339", "100413")), class = "data.frame", row.names = c(NA, 
-14L)), April = structure(list(date = structure(c(17987, 17988, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 17989, NA, 17990, NA, 
NA, NA, NA, NA), class = "Date"), ID = c("37866", "107349", "93051", 
"6187", "98274", "100420", "94436", "88995", "105107", "105109", 
"91644", "105103", "108508", "105213", "108773", "85409", "104145", 
"102636", "102339", "100413")), class = "data.frame", row.names = c(NA, 
-20L)), March = structure(list(date = structure(c(NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, 17956, NA, NA, NA, NA, NA, NA
), class = "Date"), ID = c("93051", "104499", "6187", "98274", 
"100420", "94436", "88995", "105107", "105109", "91644", "105103", 
"105213", "85409", "104145", "100989", "102636", "102339", "100413"
)), class = "data.frame", row.names = c(NA, -18L)), February = structure(list(
    date = structure(c(NA, NA, 17928, NA, NA, NA, NA, 17929, 
    17930, 17931, NA, 17932, NA, NA, 17933, NA, NA, NA, NA, NA, 
    NA, NA), class = "Date"), ID = c("94266", "93051", "104499", 
    "6187", "98274", "100420", "94436", "88995", "105107", "105109", 
    "91644", "105103", "85409", "102252", "104145", "94559", 
    "101426", "100992", "100989", "102636", "102339", "100413"
    )), class = "data.frame", row.names = c(NA, -22L)), January = structure(list(
    date = structure(17897:17914, class = "Date"), ID = c("94266", 
    "93051", "99836", "6187", "98274", "100420", "94436", "91644", 
    "85409", "102252", "94412", "94559", "101426", "100992", 
    "100989", "102636", "102339", "100413")), class = "data.frame", row.names = c(NA, 
-18L)))

【讨论】:

    【解决方案2】:

    首先将date 列更改为日期而不是数字

    l <- lapply(l, function(x) {x$date <- as.Date(x$date, origin = "1970-01-01");x})
    

    然后我们可以使用bind_rows 将数据框列表绑定为一个,group_by IDfillNA 日期,然后使用 group_split 将数据框拆分回数据框列表。

    library(dplyr)
    
    bind_rows(l, .id = "group") %>%
       mutate(group = factor(group, levels = names(l))) %>%
       group_by(ID) %>%
       tidyr::fill(date) %>%
       tidyr::fill(date, .direction = "up") %>%
       ungroup %>%
       group_split(group, keep = FALSE) %>%
       setNames(names(l))
    
    #$May
    # A tibble: 14 x 2
    #   date       ID    
    #   <date>     <chr> 
    # 1 2019-04-02 107349
    # 2 2019-05-01 110024
    # 3 2019-01-04 6187  
    # 4 2019-01-06 100420
    # 5 2019-01-07 94436 
    # 6 2019-02-02 88995 
    # 7 2019-05-02 110165
    # 8 2019-01-08 91644 
    # 9 2019-04-03 108508
    #10 2019-03-01 105213
    #11 2019-04-04 108773
    #12 2019-01-16 102636
    #13 2019-01-17 102339
    #14 2019-01-18 100413
    #...
    

    这是假设每个ID 在整个列表中至少有一个非NA date。当我们group_byID 非NA 值可以高于或低于NA 值与相同ID 因此,我们需要fill NA 两个方向的值(默认为"down") .我们在bind_rows 期间创建"group" 列,以识别哪些值来自哪个列表,以便我们以后可以使用它再次拆分它。

    【讨论】:

    • 您能详细说明一下吗? fill 看起来很酷,但双重使用似乎是一个更巧妙的技巧。现在没有我的 R 会话来检查,但会做
    • @Tjebo 在答案中添加了一些解释。
    • 也许我误解了这个问题,但是为什么May 输出包含20 行,而May 输入只包含14 行?
    • @JorisChau no..你没有。我没有正确检查输出。我已经更新了,应该是正确的。感谢您的关注:)
    • 谢谢看看那个。整洁的。也可以使用@akrun's nice trick 将 NA 排列在第一个位置
    猜你喜欢
    • 2019-11-21
    • 1970-01-01
    • 2019-04-03
    • 1970-01-01
    • 2015-02-17
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多