dplyr 的意外输出（if_else 语句）答案

【问题标题】：Unexpected output from dplyr (if_else statement)dplyr 的意外输出（if_else 语句）
【发布时间】：2018-08-01 07:27:07
【问题描述】：

我无法弄清楚为什么 if_else 会这样，这可能是我的代码或数据的结构方式。

下面是一个正在开发的数据库的快照，它代表了一项对参与试验的研究参与者进行的纵向调查，并每周进行一次跟进。

变量 “survey_start” 表示研究定义的一年随访的开始（我们称之为 “survey_year”）。

我正在尝试填充每个调查年份的每个参与者的所有后续条目，条目“调查”后跟一个下划线和相应的年份，例如。调查_2014。

缺少条目，例如此处代表的参与者，在 2015 年调查开始时不可用。

我已经编写了两个代码，第一个失败，而第二个有效，唯一的区别是我颠倒了第二个代码中条目的填充顺序（从 2007-2016 到 2016-2007）并删除了 if_else 2015 年的声明。

请协助解决这个问题...

    trialData <- structure(list(study = c("site_1", "site_1", "site_1", "site_1", 
"site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", 
"site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", 
"site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", 
"site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", 
"site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", 
"site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", 
"site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", 
"site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", 
"site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", 
"site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", 
"site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", 
"site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", 
"site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", 
"site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", 
"site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", 
"site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", 
"site_1", "site_1", "site_1", "site_1", "site_1", "site_1", "site_1", 
"site_1", "site_1"), studyno = c("child_1", "child_1", "child_1", 
"child_1", "child_1", "child_1", "child_1", "child_1", "child_1", 
"child_1", "child_1", "child_1", "child_1", "child_1", "child_1", 
"child_1", "child_1", "child_1", "child_1", "child_1", "child_1", 
"child_1", "child_1", "child_1", "child_1", "child_1", "child_1", 
"child_1", "child_1", "child_1", "child_1", "child_1", "child_1", 
"child_1", "child_1", "child_1", "child_1", "child_1", "child_1", 
"child_1", "child_1", "child_1", "child_1", "child_1", "child_1", 
"child_1", "child_1", "child_1", "child_1", "child_1", "child_1", 
"child_1", "child_1", "child_1", "child_1", "child_1", "child_1", 
"child_1", "child_1", "child_1", "child_1", "child_1", "child_1", 
"child_1", "child_1", "child_1", "child_1", "child_1", "child_1", 
"child_1", "child_1", "child_1", "child_1", "child_1", "child_1", 
"child_1", "child_1", "child_1", "child_1", "child_1", "child_1", 
"child_1", "child_1", "child_1", "child_1", "child_1", "child_1", 
"child_1", "child_1", "child_1", "child_1", "child_1", "child_1", 
"child_1", "child_1", "child_1", "child_1", "child_1", "child_1", 
"child_1", "child_1", "child_1", "child_1", "child_1", "child_1", 
"child_1", "child_1", "child_1", "child_1", "child_1", "child_1", 
"child_1", "child_1", "child_1", "child_1", "child_1", "child_1", 
"child_1", "child_1", "child_1", "child_1", "child_1", "child_1", 
"child_1", "child_1"), date = structure(c(16078, 16085, 16092, 
16098, 16104, 16115, 16121, 16129, 16135, 16140, 16146, 16156, 
16162, 16168, 16177, 16185, 16191, 16195, 16203, 16210, 16217, 
16225, 16234, 16237, 16246, 16253, 16262, 16269, 16278, 16283, 
16288, 16297, 16304, 16311, 16319, 16326, 16332, 16337, 16346, 
16353, 16360, 16366, 16370, 16381, 16384, 16395, 16399, 16407, 
16415, 16422, 16444, 16452, 16454, 16467, 16474, 16477, 16484, 
16490, 16501, 16508, 16514, 16520, 16529, 16533, 16539, 16550, 
16556, 16564, 16566, 16578, 16582, 16593, 16599, 16604, 16613, 
16620, 16623, 16635, 16636, 16654, 16660, 16666, 16673, 16681, 
16688, 16693, 16702, 16706, 16714, 16721, 16728, 16734, 16745, 
16749, 16757, 16764, 16769, 16778, 16785, 16792, 16805, 16812, 
16819, 16830, 16832, 16839, 16846, 16856, 16862, 16867, 16877, 
16884, 16890, 16898, 16904, 16912, 16917, 16923, 16936, 16938, 
16953, 16960, 16966, 16973, 16980), class = "Date"), year = c(2014L, 
2014L, 2014L, 2014L, 2014L, 2014L, 2014L, 2014L, 2014L, 2014L, 
2014L, 2014L, 2014L, 2014L, 2014L, 2014L, 2014L, 2014L, 2014L, 
2014L, 2014L, 2014L, 2014L, 2014L, 2014L, 2014L, 2014L, 2014L, 
2014L, 2014L, 2014L, 2014L, 2014L, 2014L, 2014L, 2014L, 2014L, 
2014L, 2014L, 2014L, 2014L, 2014L, 2014L, 2014L, 2014L, 2014L, 
2014L, 2014L, 2014L, 2014L, 2015L, 2015L, 2015L, 2015L, 2015L, 
2015L, 2015L, 2015L, 2015L, 2015L, 2015L, 2015L, 2015L, 2015L, 
2015L, 2015L, 2015L, 2015L, 2015L, 2015L, 2015L, 2015L, 2015L, 
2015L, 2015L, 2015L, 2015L, 2015L, 2015L, 2015L, 2015L, 2015L, 
2015L, 2015L, 2015L, 2015L, 2015L, 2015L, 2015L, 2015L, 2015L, 
2015L, 2015L, 2015L, 2015L, 2015L, 2015L, 2015L, 2015L, 2015L, 
2016L, 2016L, 2016L, 2016L, 2016L, 2016L, 2016L, 2016L, 2016L, 
2016L, 2016L, 2016L, 2016L, 2016L, 2016L, 2016L, 2016L, 2016L, 
2016L, 2016L, 2016L, 2016L, 2016L, 2016L, 2016L), month = c(1L, 
1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 5L, 
5L, 5L, 5L, 5L, 6L, 6L, 6L, 6L, 7L, 7L, 7L, 7L, 8L, 8L, 8L, 8L, 
8L, 9L, 9L, 9L, 9L, 10L, 10L, 10L, 10L, 10L, 11L, 11L, 11L, 11L, 
12L, 12L, 12L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 
4L, 4L, 4L, 4L, 5L, 5L, 5L, 5L, 5L, 6L, 6L, 6L, 6L, 7L, 7L, 7L, 
7L, 8L, 8L, 8L, 8L, 9L, 9L, 9L, 9L, 9L, 10L, 10L, 10L, 10L, 11L, 
11L, 11L, 11L, 11L, 12L, 12L, 12L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 
2L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 5L, 5L, 5L, 6L, 6L, 6L, 
6L, 6L), survey_start = c("", "", "", "", "", "", "", "", "", 
"", "", "", "", "", "", "", "", "Y", "", "", "", "", "", "", 
"", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", 
"", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", 
"", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", 
"", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", 
"", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", 
"", "", "", "", "", "", "Y", "", "", "", "", "", "", "", "", 
"", "", "", "", "", "")), class = "data.frame", row.names = c(NA, 
-125L), .Names = c("study", "studyno", "date", "year", "month", 
"survey_start"))

代码 1 失败：

 trialData <- trialData %>% arrange(studyno, date) %>% group_by(studyno) %>%
mutate(survey_year = if_else(date >= date[survey_start == "Y" & year == 2007 & study == "site_1"][1] & date < date[month == 5 & year == 2008 & study == "site_1"][1], "survey_2007",
                     if_else(date >= date[survey_start == "Y" & year == 2008 & study == "site_1"][1] & date < date[month == 4 & year == 2009 & study == "site_1"][1], "survey_2008",
                     if_else(date >= date[survey_start == "Y" & year == 2009 & study == "site_1"][1] & date < date[month == 5 & year == 2010 & study == "site_1"][1], "survey_2009",
                     if_else(date >= date[survey_start == "Y" & year == 2010 & study == "site_1"][1] & date < date[month == 5 & year == 2011 & study == "site_1"][1], "survey_2010",
                     if_else(date >= date[survey_start == "Y" & year == 2011 & study == "site_1"][1] & date < date[month == 4 & year == 2012 & study == "site_1"][1], "survey_2011",
                     if_else(date >= date[survey_start == "Y" & year == 2012 & study == "site_1"][1] & date < date[month == 4 & year == 2013 & study == "site_1"][1], "survey_2012",
                     if_else(date >= date[survey_start == "Y" & year == 2013 & study == "site_1"][1] & date < date[month == 4 & year == 2014 & study == "site_1"][1], "survey_2013",
                     if_else(date >= date[survey_start == "Y" & year == 2014 & study == "site_1"][1] & date < date[month == 4 & year == 2015 & study == "site_1"][1], "survey_2014",
                     if_else(date >= date[survey_start == "Y" & year == 2015 & study == "site_1"][1] & date < date[month == 3 & year == 2016 & study == "site_1"][1], "survey_2015",        
                     if_else(date >= date[survey_start == "Y" & year == 2016 & study == "site_1"][1], "survey_2016","")))))))))))

代码 2 有效：

    trialData <- trialData %>% arrange(studyno, date) %>% group_by(studyno) %>%
  mutate(survey_year = if_else(date >= date[survey_start == "Y" & year == 2016 & study == "site_1"][1]                                                               , "survey_2016",
                           if_else(date >= date[survey_start == "Y" & year == 2014 & study == "site_1"][1] & date < date[month == 4 & year == 2015 & study == "site_1"][1], "survey_2014",
                           if_else(date >= date[survey_start == "Y" & year == 2013 & study == "site_1"][1] & date < date[month == 4 & year == 2014 & study == "site_1"][1], "survey_2013",
                           if_else(date >= date[survey_start == "Y" & year == 2012 & study == "site_1"][1] & date < date[month == 4 & year == 2013 & study == "site_1"][1], "survey_2012",
                           if_else(date >= date[survey_start == "Y" & year == 2011 & study == "site_1"][1] & date < date[month == 4 & year == 2012 & study == "site_1"][1], "survey_2011",
                           if_else(date >= date[survey_start == "Y" & year == 2010 & study == "site_1"][1] & date < date[month == 5 & year == 2011 & study == "site_1"][1], "survey_2010",
                           if_else(date >= date[survey_start == "Y" & year == 2009 & study == "site_1"][1] & date < date[month == 5 & year == 2010 & study == "site_1"][1], "survey_2009",
                           if_else(date >= date[survey_start == "Y" & year == 2008 & study == "site_1"][1] & date < date[month == 4 & year == 2009 & study == "site_1"][1], "survey_2008",
                           if_else(date >= date[survey_start == "Y" & year == 2007 & study == "site_1"][1] & date < date[month == 5 & year == 2008 & study == "site_1"][1], "survey_2007",""))))))))))

【问题讨论】：

我认为你可以在没有嵌套的ifelse 语句的情况下做到这一点。创建一个 key/val 数据集，然后进行合并
非常不清楚你想要什么，但我怀疑因为你的代码很复杂，你的代码中潜伏着一个隐藏的错误。在明确您要完成的工作之前，很难为您提供帮助。
嗨@akrun，请举一个附加数据集的例子。另外，很高兴您注意到，因为我们预计会有更多跨越多年的数据集，而且我认为许多 if_else 语句将带来挑战。..
@AidanGawronski，附加的数据集包含该孩子从 2014 年到 2016 年的每周随访。每年都会对未来 1 年进行一项调查，每个孩子都有不同的开始日期。我打算对属于特定“调查年度”的所有条目进行分组，方法是将它们标记为例如“survey_2014”（在变量“survey_year”下），用于属于 2014 年调查年度的所有条目。此外，鉴于孩子的开始日期是一年内的任何时间，跟进的结束很可能会在下一年内结束，而下一个将在前一个结束后不久开始

标签： r dplyr

【解决方案1】：

正如@akrun 评论的那样，您可以通过合并数据而不是使用if_else 来完成此操作。该过程大致如下：

创建仅包含开始调查年度的访问的数据集。
- 在此处定义开始和结束日期以及调查年份标签
将起始访问数据加入原始数据
- 保留调查年度内的行
- 仅选择标识访问所需的列和调查年份标签
将结果连接回原始数据。

以下是使用dplyr 的方法：

library(tidyverse)
library(lubridate)

# Modify the data so that there's an overlap of survey years,
# in order to demonstrate how to deal with it
df <- as_tibble(trialData) %>% 
  mutate(survey_start = if_else(row_number() == 52, "Y", survey_start))

# Pick out rows that start a "survey year"
starts <- df %>% 
  filter(survey_start == "Y") %>% 
  group_by(study, studyno) %>% 
  transmute(
    survey_year = str_c("survey_", year),
    start_date = date,
    end_date   = pmin(
      start_date + years(1),  # make sure that the survey year
      lead(start_date),       # ends before next one starts
      na.rm = T
    )
  ) %>% ungroup()
#> Adding missing grouping variables: `study`, `studyno`

# Join all starts to the visit data
years <- df %>% 
  left_join(starts) %>% 
  # Keep rows which fall within one year of a start
  filter(date >= start_date, date < end_date) %>% 
  select(study, studyno, date, survey_year)
#> Joining, by = c("study", "studyno")

现在years 包含“调查年度”内的所有访问

# Join the year classifications to the original data
result <- df %>%
  left_join(years)
#> Joining, by = c("study", "studyno", "date")
stopifnot(nrow(result) == nrow(df))

我们也可以查看结果：

# Check the rows before and after each start
i <- which(result$survey_start == "Y")
result %>% slice(sort(c(i - 1, i, i + 1)))
#> # A tibble: 9 x 7
#>   study  studyno date        year month survey_start survey_year
#>   <chr>  <chr>   <date>     <int> <int> <chr>        <chr>      
#> 1 site_1 child_1 2014-05-01  2014     5 ""           <NA>       
#> 2 site_1 child_1 2014-05-05  2014     5 Y            survey_2014
#> 3 site_1 child_1 2014-05-13  2014     5 ""           survey_2014
#> 4 site_1 child_1 2015-01-09  2015     1 ""           survey_2014
#> 5 site_1 child_1 2015-01-17  2015     1 Y            survey_2015
#> 6 site_1 child_1 2015-01-19  2015     1 ""           survey_2015
#> 7 site_1 child_1 2016-03-07  2016     3 ""           <NA>       
#> 8 site_1 child_1 2016-03-17  2016     3 Y            survey_2016
#> 9 site_1 child_1 2016-03-24  2016     3 ""           survey_2016

由reprex package (v0.2.0) 于 2018 年 2 月 22 日创建。

【讨论】：

嗨@mikko，谢谢你的代码，让我在更大的数据集上测试一下，如果遇到问题就回来..
嗨@Mikko，我注意到这行代码end_date = start_date + years(1) 不会照顾没有完整跟进的参与者，即那些没有跨越一整年的条目的参与者。因此，他们可能拥有特定年份 6 个月的数据。有没有办法解决这个问题？
您的示例数据最后有这样的情况，并且处理正确，不是吗？
结束日期仅用于定义参与者的访问全部归类到该调查年份的间隔
明确地，您检查date >= start_date & date < end_date 了解每位参与者的每次访问和每次调查