【问题标题】：Calculated Column Based on Rows with Date Range基于具有日期范围的行计算的列
【发布时间】：2021-11-08 15:19:54
【问题描述】：

我有一个如下的数据框：

ID	Col1	RespID	Col3	Col4	Year	Month	Day
1	blue	729Ad	3.2	A	2021	April	2
2	orange	295gS	6.5	A	2021	April	1
3	red	729Ad	8.4	B	2021	April	20
4	yellow	592Jd	2.9	A	2021	March	12
5	green	937sa	3.5	B	2021	May	13

我想计算一个新列 Col5，如果该行的 Col4 值为 A，并且在数据集中的某处存在另一列，则该列具有相同的 RespId 但 Col4 值为 B。否则它的值为 0。然后我将删除 Col4 值为 B 的所有行，只保留那些具有 A 的行。我还想考虑日期字段（年、月、日期），以便分组完成基于比如说 30 天的时间范围。因此，如果“B”出现在“A”出现在数据集中的 30 天内，那么只有 1 存在（如果“B”出现在 60 天内，则没有 1。另外，我想保留一切都是 data.frames。

这是在删除 Col4 值为 B 的行之前所需的输出表的样子：

ID	Col1	RespID	Col3	Col4	Col5
1	blue	729Ad	3.2	A	1
2	orange	295gS	6.5	A	0
3	red	729Ad	8.4	B	0
4	yellow	592Jd	2.9	A	0
5	green	937sa	3.5	B	0

我在此线程 (Calculated Column Based on Rows in Tidymodels Recipe) 中发现 Ronak 的解决方案很有用，但是想针对日期范围进行修改。

【问题讨论】：

Col1 和 Col3 列的意义何在？他们发挥作用吗？
B 总是在A 之后吗？
每个RespID 是否总是只有一到两个观察值？如果是，Col4 的值是否始终分别为 A 和 A/B？
Then I will drop all rows with Col4 value of B, to keep just those with A. - 您想要的输出表中仍有Col4 值B 的行。 @ava

标签： r dplyr

【解决方案1】：

这里有很多东西要解压。我认为你试图一次做太多事情而绊倒了自己。我已将代码分解为四个不同的步骤，以使思考过程易于遵循。显然，为了在生产环境中使用它应该更有效地重写。

1。生成一些数据

library(tidyverse)
set.seed(42)

df <- tibble(
    id = c(1:10),
    resp_id = c(1701, seq(2286, 2289), 1701, seq(2290, 2293)),
    grouping = sample(c("A", "B"), size = 10, replace = TRUE),
    date = seq.Date(as.Date("2363-10-04"), as.Date("2363-11-17"), length.out = 10)
)

结果数据：

# A tibble: 10 × 4
      id resp_id grouping date      
   <int>   <dbl> <chr>    <date>    
 1     1    1701 A        2363-10-04
 2     2    2286 A        2363-10-08
 3     3    2287 A        2363-10-13
 4     4    2288 A        2363-10-18
 5     5    2289 B        2363-10-23
 6     6    1701 B        2363-10-28
 7     7    2290 B        2363-11-02
 8     8    2291 B        2363-11-07
 9     9    2292 A        2363-11-12
10    10    2293 B        2363-11-17

2。检查分组

df <- df %>%
    mutate(
        is_a = ifelse(grouping == "A", 1, 0),
        is_b = ifelse(grouping == "B", 1, 0)
    )

我们现在将分组作为易于使用的虚拟变量：

> df
# A tibble: 10 × 6
      id resp_id grouping date        is_a  is_b
   <int>   <dbl> <chr>    <date>     <dbl> <dbl>
 1     1    1701 A        2363-10-04     1     0
 2     2    2286 A        2363-10-08     1     0
 3     3    2287 A        2363-10-13     1     0
 4     4    2288 A        2363-10-18     1     0
 5     5    2289 B        2363-10-23     0     1
 6     6    1701 B        2363-10-28     0     1
 7     7    2290 B        2363-11-02     0     1
 8     8    2291 B        2363-11-07     0     1
 9     9    2292 A        2363-11-12     1     0
10    10    2293 B        2363-11-17     0     1

3。检查完整性

df <- df %>%
    group_by(
        resp_id
    ) %>%
    mutate(
        # Check if the grouping has both "A" and "B" values
        is_complete = ifelse(
            sum(is_a) > 0 & sum(is_b) > 0, 
            1, 
            0
        )
    ) %>%
    ungroup()

我们看到只有一个resp_id 值是完整的——1701：

> df
# A tibble: 10 × 7
      id resp_id grouping date        is_a  is_b is_complete
   <int>   <dbl> <chr>    <date>     <dbl> <dbl>       <dbl>
 1     1    1701 A        2363-10-04     1     0           1
 2     2    2286 A        2363-10-08     1     0           0
 3     3    2287 A        2363-10-13     1     0           0
 4     4    2288 A        2363-10-18     1     0           0
 5     5    2289 B        2363-10-23     0     1           0
 6     6    1701 B        2363-10-28     0     1           1
 7     7    2290 B        2363-11-02     0     1           0
 8     8    2291 B        2363-11-07     0     1           0
 9     9    2292 A        2363-11-12     1     0           0
10    10    2293 B        2363-11-17     0     1           0

4。分配目标值

df <- df %>%
    group_by(
        resp_id
    ) %>%
    mutate(
        # Check if the "A" part of a complete grouping has a another value within 30 days
        is_within_timeframe = ifelse(
            is_complete == 1 & is_a == 1 & max(date) - min(date) <= 30, 
            1, 
            0
        )
    ) %>%
    ungroup()

我们看到我们的一个完整集合实际上有一个 B 值，该值在 A 观察后的 30 天内（警告：这仅在始终只有一两个观察值时才有效每个分组！）。 is_within_timeframe 列对应您的Col4：

> df
# A tibble: 10 × 8
      id resp_id grouping date        is_a  is_b is_complete is_within_timeframe
   <int>   <dbl> <chr>    <date>     <dbl> <dbl>       <dbl>               <dbl>
 1     1    1701 A        2363-10-04     1     0           1                   1
 2     2    2286 A        2363-10-08     1     0           0                   0
 3     3    2287 A        2363-10-13     1     0           0                   0
 4     4    2288 A        2363-10-18     1     0           0                   0
 5     5    2289 B        2363-10-23     0     1           0                   0
 6     6    1701 B        2363-10-28     0     1           1                   0
 7     7    2290 B        2363-11-02     0     1           0                   0
 8     8    2291 B        2363-11-07     0     1           0                   0
 9     9    2292 A        2363-11-12     1     0           0                   0
10    10    2293 B        2363-11-17     0     1           0                   0

【讨论】：

感谢 Roman，您的方法对我来说很有意义。我正在对我的数据集进行测试。我在创建 is_a 和 is_b 字段时遇到错误：Error: Problem with `mutate()` column `is_a`. i `is_a= ifelse(grouping == "a", 1, 0)`. x comparison (1) is possible only for atomic and list types Run `rlang::last_error()` to see where the error occurred. 对此有何想法？似乎数据类型可能输入不正确？
@ava 看来您复制粘贴/更改代码时出错了。尝试复制所有内容并按原样运行。