【问题标题】:Joining two data frames based on the lower and upper limit of the target value in R根据R中目标值的下限和上限连接两个数据帧
【发布时间】:2018-07-25 09:04:52
【问题描述】:

我有两个数据框,df1df2。我想以某种方式加入这两个,我将目标值从df2 添加到df1df1df2 通过列组和值关联。在df1 中,我有一个特定的值,在df2 中,我只有适用值的下限和上限。

如果我们查看df1df2,我相信任务应该很清楚。

df1 <- data.frame(group = c("A","B","C","D"),
                  value = c(15, 0, 40, 70))

df2 <- data.frame(group = c("A","A","A","A",
                            "B","B","B","B",
                            "C","C","C","C",
                            "D","D","D","D"),
                  lower_limit = c(0, 25, 60, 91,
                                  0, 35, 70, 92,
                                  0, 45, 80, 93,
                                  0, 55, 90, 94),
                  upper_limit = c(25, 60, 91, 100, 
                                  35, 70, 92, 100, 
                                  45, 80, 93, 100, 
                                  55, 90, 94, 100),
                  target = c("AGE0", "AGE1", "AGE3", "AGE4",
                             "AGE0", "AGE1", "AGE3", "AGE4",
                             "AGE0", "AGE1", "AGE3", "AGE4",
                             "AGE0", "AGE1", "AGE3", "AGE4"))

使用嵌套的 for 和 if 循环,我可以执行此任务。但是我的原始数据要大得多,我不能使用这个循环。我确信我的任务有一个更简单的解决方案。有什么建议吗?

for (i in 1:nrow(df1)){
  subset_string = df1[i, 1]
  target_value = df1[i, 2]

  df2_subset <- df2[df2$group == subset_string, ]

  for (j in 1:nrow(df2_subset)){

    temp_sequence <- seq(from = df2_subset[j, 2], to = df2_subset[j, 3] - 1)
    if  (target_value %in% temp_sequence){
      target_string <- df2_subset[j, 4]
    }

    df1[i, 3] <- target_string
  }
}

【问题讨论】:

    标签: r dataframe join match


    【解决方案1】:

    不确定想要的结果。也许用 sdqldf:

    df1 <- data.frame(group = c("A","B","C","D"),
                      value = c(15, 0, 40, 70))
    
    df2 <- data.frame(group = c("A","A","A","A",
                                "B","B","B","B",
                                "C","C","C","C",
                                "D","D","D","D"),
                      lower_limit = c(0, 25, 60, 91,
                                      0, 35, 70, 92,
                                      0, 45, 80, 93,
                                      0, 55, 90, 94),
                      upper_limit = c(25, 60, 91, 100, 
                                      35, 70, 92, 100, 
                                      45, 80, 93, 100, 
                                      55, 90, 94, 100),
                      target = c("AGE0", "AGE1", "AGE3", "AGE4",
                                 "AGE0", "AGE1", "AGE3", "AGE4",
                                 "AGE0", "AGE1", "AGE3", "AGE4",
                                 "AGE0", "AGE1", "AGE3", "AGE4"))
    
    library(sqldf)
    sqldf("select a.*, b.target
             from df1 a
             left join df2 b
               on a.`group` = b.`group`
                 AND a.value >= b.lower_limit 
                 AND a.value <= b.upper_limit")
    
    # group value target
    #1     A    15   AGE0
    #2     B     0   AGE0
    #3     C    40   AGE0
    #4     D    70   AGE1
    

    【讨论】:

    • 是的!就是这样......这也应该适用于我的大df。谢谢。
    • 你可以用a.value between b.lower_limit and b.upper_limit稍微缩短它
    【解决方案2】:

    data.table 方法可以是

    library(data.table)
    
    setDT(df2)[setDT(df1), .(group, value, target), 
               on = .(lower_limit <= value, upper_limit >= value, group)]
    

    给了

       group value target
    1:     A    15   AGE0
    2:     B     0   AGE0
    3:     C    40   AGE0
    4:     D    70   AGE1
    


    样本数据:

    df1 <- structure(list(group = structure(1:4, .Label = c("A", "B", "C", 
    "D"), class = "factor"), value = c(15, 0, 40, 70)), .Names = c("group", 
    "value"), row.names = c(NA, -4L), class = "data.frame")
    
    df2 <- structure(list(group = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 
    2L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L), .Label = c("A", "B", "C", 
    "D"), class = "factor"), lower_limit = c(0, 25, 60, 91, 0, 35, 
    70, 92, 0, 45, 80, 93, 0, 55, 90, 94), upper_limit = c(25, 60, 
    91, 100, 35, 70, 92, 100, 45, 80, 93, 100, 55, 90, 94, 100), 
        target = structure(c(1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L, 
        2L, 3L, 4L, 1L, 2L, 3L, 4L), .Label = c("AGE0", "AGE1", "AGE3", 
        "AGE4"), class = "factor")), .Names = c("group", "lower_limit", 
    "upper_limit", "target"), row.names = c(NA, -16L), class = "data.frame")
    

    更新:根据 OP 的要求 dplyr 解决方案是

    library(dplyr)
    
    df1 %>% 
      left_join(df2, by = "group") %>%
      filter(value >= lower_limit, value <= upper_limit) %>%
      select(group, value, target)
    
    #  group value target
    #1     A    15   AGE0
    #2     B     0   AGE0
    #3     C    40   AGE0
    #4     D    70   AGE1
    

    【讨论】:

    • 确实如此。 =) 谢谢大家的帮助和时间。
    猜你喜欢
    • 2021-12-16
    • 1970-01-01
    • 2020-05-23
    • 1970-01-01
    • 1970-01-01
    • 2021-11-27
    • 2018-03-27
    • 1970-01-01
    • 2020-12-15
    相关资源
    最近更新 更多