基于共享变量和共享变量范围的不等长数据帧和连接答案

【问题标题】：Data frames of unequal lengths and joins based on shared variable and range of shared variable基于共享变量和共享变量范围的不等长数据帧和连接
【发布时间】：2021-11-02 05:23:27
【问题描述】：

我需要根据df1 中的Value 变量相对于df2 中观察值的下降位置，向我的原始数据框df1 添加一列。到目前为止，我的尝试一直是使用 left_join，但由于 df1 和 df2 的长度不相等，因此效果不佳，因为 df1 由于长度不等而最终增加了行。

将df2 视为比赛中的排名，其中Value 和Amount 列从高到低排列。例如，df1 中第 2 行的返回值应为 200，因为 df1 中 450 的 Value 大于 400 的 Value（最低范围）但小于 525（最高范围）的值) 在df2。来自df1 的每个Value 变量都应该根据它所在的范围（最低和最高）进行评估，然后返回适当的Amount。当存在关联时，应返回关联的 Amount。例如，df1 中的第 3 行，Value 为 525，应返回 Amount，即 500，因为它与 df2 中的 525 的Value 绑定。

df1:
        Date Value
1  10/1/2021   500
2  10/1/2021   450
3  10/1/2021   525
4  10/1/2021   700
5  10/1/2021   250
6  10/1/2021   105
7  10/1/2021    90
8  10/1/2021   325
9  10/1/2021   300
10 10/1/2021   275
11 10/1/2021   100
12 10/1/2021   289
13 10/1/2021   230
14 10/1/2021    50

df2:
        Date Rk Value Amount
1  10/1/2021  1   600    700
2  10/1/2021  2   525    500
3  10/1/2021  3   400    200
4  10/1/2021  4   350    100
5  10/1/2021  5   325     75
6  10/1/2021  6   300     65
7  10/1/2021  7   250     55
8  10/1/2021  8   200     50
9  10/1/2021  9   150     40
10 10/1/2021 10   100     30

desired output:
        Date Value Amount
1  10/1/2021   500    200
2  10/1/2021   450    200
3  10/1/2021   525    500
4  10/1/2021   700    700
5  10/1/2021   250     55
6  10/1/2021   105     30
7  10/1/2021    90      0
8  10/1/2021   325     75
9  10/1/2021   300     65
10 10/1/2021   275     55
11 10/1/2021   100     30
12 10/1/2021   289     55
13 10/1/2021   230     50
14 10/1/2021    50      0


## original df
df1 <- structure(list(Date = c("10/1/2021", "10/1/2021", "10/1/2021", 
"10/1/2021", "10/1/2021", "10/1/2021", "10/1/2021", "10/1/2021", 
"10/1/2021", "10/1/2021", "10/1/2021", "10/1/2021", "10/1/2021", 
"10/1/2021"), Value = c(500L, 450L, 525L, 700L, 250L, 105L, 90L, 
325L, 300L, 275L, 100L, 289L, 230L, 50L)), class = "data.frame", row.names = c(NA, 
-14L))

## df with Amount variable
df2 <- structure(list(Date = c("10/1/2021", "10/1/2021", "10/1/2021", 
"10/1/2021", "10/1/2021", "10/1/2021", "10/1/2021", "10/1/2021", 
"10/1/2021", "10/1/2021"), Rk = 1:10, Value = c(600L, 525L, 400L, 
350L, 325L, 300L, 250L, 200L, 150L, 100L), Amount = c(700L, 500L, 
200L, 100L, 75L, 65L, 55L, 50L, 40L, 30L)), class = "data.frame", row.names = c(NA, 
-10L))

## desired output
desired <- structure(list(Date = c("10/1/2021", "10/1/2021", "10/1/2021", 
"10/1/2021", "10/1/2021", "10/1/2021", "10/1/2021", "10/1/2021", 
"10/1/2021", "10/1/2021", "10/1/2021", "10/1/2021", "10/1/2021", 
"10/1/2021"), Value = c(500L, 450L, 525L, 700L, 250L, 105L, 90L, 
325L, 300L, 275L, 100L, 289L, 230L, 50L), Amount = c(200L, 200L, 
500L, 700L, 55L, 30L, 0L, 75L, 65L, 55L, 30L, 55L, 50L, 0L)), class = "data.frame", row.names = c(NA, 
-14L))

【问题讨论】：

标签： r

【解决方案1】：

下面是一个基本的 R 代码：

df1$Amount <- cut(df1$Value, c(rev(df2$Value), Inf),rev(df2$Amount), right = FALSE)
df1$Amount <- as.numeric(as.character(df1$Amount))
df1$Amount[is.na(df1$Amount)] <- 0
df1
        Date Value Amount
1  10/1/2021   500    200
2  10/1/2021   450    200
3  10/1/2021   525    500
4  10/1/2021   700    700
5  10/1/2021   250     55
6  10/1/2021   105     30
7  10/1/2021    90      0
8  10/1/2021   325     75
9  10/1/2021   300     65
10 10/1/2021   275     55
11 10/1/2021   100     30
12 10/1/2021   289     55
13 10/1/2021   230     50
14 10/1/2021    50      0

【讨论】：

关于如何处理我的df$Value 不喜欢cut 的任何建议，因为没有足够的唯一值？我尝试使用.bincode 而不是cut 来绕过这个错误，但它并没有在我的真实数据中改变正确的Amount。
@On_an_island 没有足够的唯一值到底是什么意思？您甚至可以拥有其中的 5 个。只要它们是排序的。
我的意思是在我的现实世界中没有足够的独特中断df1$Value，cut 会抛出错误：Error: Problem with mutate() input Amount. x 'breaks' are not unique
@On_an_island 没有什么比不够。你的意思是你有重复？因为即使只有 2 次中断，cut 也会起作用
我接受您提供的回复，因为它适用于我的数据。 df2$Value 上的 breaks not unique 的问题可以通过删除 df2 中的重复项轻松解决。通过这样做，我不会丢失任何我需要的信息，因为副本的不同副本包含我需要在df1 中添加的值。谢谢。