根据每组的另一个查找表有条件地插入一个数据帧的值？答案

【问题标题】：Conditionally interpolate values for one data frame based on another lookup table per group?根据每组的另一个查找表有条件地插入一个数据帧的值？
【发布时间】：2019-09-25 07:05:10
【问题描述】：

这类似于下面的question。不过，我还需要做几个步骤：

• 按列分组 ID 和 order

• 对于df_dat 中的每个val，在df_lookup 表中查找对应的ratio，条件如下：

o   If val < min(df_lookup$val), set new_ratio = min(df_lookup$ratio)

o   If val > max(df_lookup$val), set new_ratio = max(df_lookup$ratio)

o   If val falls within df_lookup$val range, do a simple linear interpolation

我的数据：

library(dplyr)

df_lookup <- tribble(
  ~ID, ~order, ~pct, ~val, ~ratio,
  "batch1", 1, 1,  1, 0.2,
  "batch1", 1, 10, 8, 0.5,
  "batch1", 1, 25, 25, 1.2,
  "batch2", 2, 1, 2, 0.1,
  "batch2", 2, 10, 15, 0.75,
  "batch2", 2, 25, 33, 1.5,
  "batch2", 2, 50, 55, 3.2,
)
df_lookup
#> # A tibble: 7 x 5
#>   ID     order   pct   val ratio
#>   <chr>  <dbl> <dbl> <dbl> <dbl>
#> 1 batch1     1     1     1  0.2 
#> 2 batch1     1    10     8  0.5 
#> 3 batch1     1    25    25  1.2 
#> 4 batch2     2     1     2  0.1 
#> 5 batch2     2    10    15  0.75
#> 6 batch2     2    25    33  1.5 
#> 7 batch2     2    50    55  3.2


df_dat <- tribble(
  ~order, ~ID, ~val,
  1, "batch1", 0.1,
  1, "batch1", 30,
  1, "batch1", 2,
  1, "batch1", 12,
  2, "batch1", 45,
  2, "batch2", 1.5,
  2, "batch2", 30,
  2, "batch2", 13,
  2, "batch2", 60,
)
df_dat
#> # A tibble: 9 x 3
#>   order ID       val
#>   <dbl> <chr>  <dbl>
#> 1     1 batch1   0.1
#> 2     1 batch1  30  
#> 3     1 batch1   2  
#> 4     1 batch1  12  
#> 5     2 batch1  45  
#> 6     2 batch2   1.5
#> 7     2 batch2  30  
#> 8     2 batch2  13  
#> 9     2 batch2  60

之前的解决方案没有考虑产生错误结果的分组。

例子：

对于order = 2 和ID = batch1，new_ratio 应为 NA，因为这些条件不在查找表中。

对于order = 1、ID = batch2 和val = 30，new_ratio 不应高于1.2（最大ratio 值）。

对于order = 1、ID = batch1 和val = 2、new_ratio = 0.243，这是在 0.2 和 0.5 之间插入的 ratio 值。

任何帮助表示赞赏！

#error
df_dat %>%
  group_by(ID, order) %>%
  mutate(new_ratio = with(df_lookup, approx(val, ratio, val))$y)
#> Error: Column `new_ratio` must be length 4 (the group size) or one, not 7

#wrong output
df_dat %>%
  group_by(ID, order) %>%
  mutate(val1 = val) %>%
  mutate(new_ratio = with(df_lookup, approx(val, ratio, val1))$y)
#> # A tibble: 9 x 5
#> # Groups:   ID, order [3]
#>   order ID       val  val1 new_ratio
#>   <dbl> <chr>  <dbl> <dbl>     <dbl>
#> 1     1 batch1   0.1   0.1    NA    
#> 2     1 batch1  30    30       1.39 
#> 3     1 batch1   2     2       0.1  
#> 4     1 batch1  12    12       0.643
#> 5     2 batch1  45    45       2.43 
#> 6     2 batch2   1.5   1.5     0.15 
#> 7     2 batch2  30    30       1.39 
#> 8     2 batch2  13    13       0.679
#> 9     2 batch2  60    60      NA

预期输出

# A tibble: 9 x 4
  order ID       val new_ratio
  <dbl> <chr>  <dbl>     <dbl>
1     1 batch1   0.1     0.2  
2     1 batch1  30       1.2  
3     1 batch1   2       0.243
4     1 batch1  12       0.643
5     2 batch1  45      NA    
6     2 batch2   1.5     0.1 
7     2 batch2  30       1.38 
8     2 batch2  13       0.65 
9     2 batch2  60       3.2

【问题讨论】：

嗨里斯。您能否添加您的预期输出（不仅仅是错误的输出）。我对您的问题陈述也不完全清楚。您之前的问题似乎完全不同。你为什么在这里使用approx？看起来您并没有尝试插入任何内容。除非我错过了什么？
如果val 在查找表中介于val 之间，我需要在范围之间进行线性插值ratio。我按照您的建议添加了预期的输出。谢谢

标签： r dataframe dplyr data.table lookup-tables

【解决方案1】：

这是我解决您的问题的方法，使用 data.table

我使用了很多中间步骤，所以你可以检查结果并操作每个步骤，看看发生了什么/所以代码可以缩短很多。

library(data.table)

#set data to data.tables
setDT(df_dat); setDT(df_lookup)

#set range df_lookup values by ID and order combination
df_lookup[, `:=`( val2   = shift( val, type = "lead" ),
                  ratio2 = shift( ratio, type = "lead" ) ), 
          by = .( ID, order ) ][]

#join non-equi
df_dat[ df_lookup, 
        `:=`( val_start = i.val, 
              val_end = i.val2, 
              ratio_start = i.ratio, 
              ratio_end = i.ratio2 ), 
        on = .( ID, order, val > val, val < val2) ][]


#interpolatie new_ratio for values that fall within a range of dt_lookup
df_dat[, new_ratio := ratio_start + ( (val - val_start) * (ratio_end - ratio_start) / (val_end - val_start) )][]

#create data.table with ratio-value for minimum- and maximum value in df_lookup
df_lookup_min_max <- df_lookup[, .( val_min = min( val ), val_max = max( val ),
                                    ratio_min = min( ratio ), ratio_max = max( ratio ) ), 
                               by = .(ID, order) ]
df_lookup_min_max_melt <- melt( df_lookup_min_max, 
                                id.vars = c( "ID", "order" ),
                                measure.vars = patterns( val = "^val", 
                                                         ratio = "^ratio" ) )

df_dat[ is.na( new_ratio ), 
        new_ratio := df_lookup_min_max_melt[ df_dat[ is.na( new_ratio ), ],
                                             ratio, 
                                             on = .(ID, order, val ),
                                             roll = "nearest" ] ][]

df_dat[, `:=`(val_start = NULL, val_end = NULL, ratio_start = NULL, ratio_end = NULL)][]

最终输出

#    order     ID  val new_ratio
# 1:     1 batch1  0.1 0.2000000
# 2:     1 batch1 30.0 1.2000000
# 3:     1 batch1  2.0 0.2428571
# 4:     1 batch1 12.0 0.6647059
# 5:     2 batch1 45.0        NA
# 6:     2 batch2  1.5 0.1000000
# 7:     2 batch2 30.0 1.3750000
# 8:     2 batch2 13.0 0.6500000
# 9:     2 batch2 60.0 3.2000000

编辑

5: 2 batch1 45.0 NA 行在这里是因为您的 df_lookup 中没有 order == 2 & ID == batch1 组合...
也许这是一个错字？
尽管如此：代码似乎处理得很好;-)

【讨论】：

你能解释一下滚动连接部分吗？
是的...间隔上的连接非常简单。滚动连接部分用于连接到最小值或最大值。它连接到最接近的值，因此如果您的值 max，它将连接到 max。介于 min 和 max 之间的值可能会导致问题，但您已经在上一步中加入了这些值...

【解决方案2】：

library(dplyr)
df_dat %>% 
left_join(df_lookup, by=c('ID','order'), suffix = c(".dat", ".lkp")) %>% 
group_by(ID, order, val.dat) %>% 
mutate(ratio_new = case_when(val.dat < min(val.lkp) ~ min(ratio),
                             val.dat > max(val.lkp) ~ max(ratio),
                             #Add ifelse to handle the scenarios where val.lkp and ratio are NAs as approx will fail in these scenarios  
                             between(val.dat, min(val.lkp), max(val.lkp)) ~ ifelse(all(is.na(ratio)), NA_real_, approx(x=val.lkp, y=ratio, xout=val.dat)$y), 
                             TRUE ~ NA_real_)) %>% 
slice(1)

# A tibble: 9 x 7
# Groups:   ID, order, val.dat [9]
   order ID     val.dat   pct val.lkp ratio ratio_new
   <dbl> <chr>    <dbl> <dbl>   <dbl> <dbl>     <dbl>
1     1 batch1     0.1     1       1   0.2     0.2  
2     1 batch1     2       1       1   0.2     0.243
3     1 batch1    12       1       1   0.2     0.665
4     1 batch1    30       1       1   0.2     1.2  
5     2 batch1    45      NA      NA  NA      NA    
6     2 batch2     1.5     1       2   0.1     0.1  
7     2 batch2    13       1       2   0.1     0.65 
8     2 batch2    30       1       2   0.1     1.38 
9     2 batch2    60       1       2   0.1     3.2

【讨论】：

谢谢。我们可以在ratio 范围之间使用线性插值吗？
它现在可以完成这项工作。但是当我用if_else 替换ifelse 时，出现错误``false` must be length 1 (length of condition), not 3`。你知道为什么吗？
使用 order=1 和 ID=batch1，approx(x=c(1,8,25), y=c(.2,.5,1.5), xout=c(2,2,2))$y 是我们在错误消息中得到 3 的原因，因为 all(is.na(c(.2,.5,1.2))) 的长度为 1，approx(...) 的长度为 3。@ 987654330@ 比 if_else 更好地处理这种情况，因为 ?ifelse ifelse 返回一个与 test ... 形状相同的值，而 if_else 正如 Hadley 在 13:30 说的 here 和在许多地方，它的限制超出了需要。

【解决方案3】：

在data.table 中使用roll 和rollends 的选项：

df_lookup[, m := (ratio - shift(ratio, -1L)) / (val - shift(val, -1L))]

df_dat[, new_ratio := 
        df_lookup[.SD, on=.(order, ID, val), roll=Inf, rollends=c(FALSE, FALSE), 
            x.m * (i.val - x.val) + x.ratio]
    ]

#for val in df_dat that are more than those in df_lookup
df_dat[is.na(new_ratio), new_ratio := 
    df_lookup[copy(.SD), on=.(order, ID, val), roll=Inf, x.ratio]]

#for val in df_dat that are less than those in df_lookup
df_dat[is.na(new_ratio), new_ratio := 
        df_lookup[copy(.SD), on=.(order, ID, val), roll=-Inf, x.ratio]]

输出：

   order     ID  val new_ratio
1:     1 batch1  0.1 0.2000000
2:     1 batch1 30.0 1.2000000
3:     1 batch1  2.0 0.2428571
4:     1 batch1 12.0 0.6647059
5:     2 batch1 45.0        NA
6:     2 batch2  1.5 0.1000000
7:     2 batch2 30.0 1.3750000
8:     2 batch2 13.0 0.6500000
9:     2 batch2 60.0 3.2000000

数据：

library(data.table)
df_lookup <- fread('ID, order, pct, val, ratio
"batch1", 1, 1,  1, 0.2
"batch1", 1, 10, 8, 0.5
"batch1", 1, 25, 25, 1.2
"batch2", 2, 1, 2, 0.1
"batch2", 2, 10, 15, 0.75
"batch2", 2, 25, 33, 1.5
"batch2", 2, 50, 55, 3.2')

df_dat <- fread('order, ID, val
1, "batch1", 0.1
1, "batch1", 30
1, "batch1", 2
1, "batch1", 12
2, "batch1", 45
2, "batch2", 1.5
2, "batch2", 30
2, "batch2", 13
2, "batch2", 60')

最后两行代码也可以用非等连接代替：

df_dat[is.na(new_ratio), new_ratio:= 
    df_lookup[copy(.SD), on=.(order, ID, val<val), x.ratio, mult="last"]]
df_dat[is.na(new_ratio), new_ratio:= 
    df_lookup[copy(.SD), on=.(order, ID, val>val), x.ratio, mult="first"]]
df_dat

【讨论】：

结果与我的预期输出不符。你能再检查一下你的代码吗？