【问题标题】:Add rows in data frame if observations are missing [duplicate]如果缺少观察值,则在数据框中添加行[重复]
【发布时间】:2020-03-08 23:02:42
【问题描述】:

我有一个 df1,每 (id) 有多个问卷(测量),这些问卷在特定时间点(日期)得到回答.通常,每个人都应每次会话填写三份调查问卷第一、前、后)。一些参与者未能填写所有三份问卷。他们可能只回答三个中的一两个。因此,可能的模式可能是完整的(参与者 A)、缺少“post”(参与者 B)、缺少“first”(参与者 C)、缺少“pre”(参与者 D),或者只回答了三个中的一个(参与者E、F、G)。

见df1:

df1 <- structure(list(id = structure(c(1L, 1L, 1L, 2L, 2L, 3L, 3L, 4L,  4L, 5L, 6L, 7L), .Label = c("A", "B", "C", "D", "E", "F", "G"), class = "factor"), measure = structure(c(1L, 3L, 2L, 1L, 3L, 3L, 2L, 1L, 2L, 1L, 3L, 2L), .Label = c("first", "post", "pre"), class = "factor"), date = structure(c(17558, 17558, 17558,  17558, 17559, 17559, 17559, 17559, 17558, 17558, 17558, 17558 ), class = "Date"), result = c(1, 5, 4, 7, 8, 7, 2, 1, 3, 5, 7, 7)), class = "data.frame", row.names = c(NA, -12L))

现在,我想在数据集中添加缺失的行,其中包含 id 和 measure 以及缺失日期和结果的“NA”。最终的 df 应该看起来像 df2。

df2 <- structure(list(id = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L, 4L, 4L, 4L, 5L, 5L, 5L, 6L, 6L, 6L, 7L, 7L, 7L), .Label = c("A", "B", "C", "D", "E", "F", "G"), class = "factor"), measure = structure(c(1L, 3L, 2L, 1L, 3L, 2L, 1L, 3L, 2L, 1L, 3L, 2L, 1L, 3L, 2L, 1L, 3L, 2L, 1L, 3L, 2L), .Label = c("first", "post", "pre"), class = "factor"), date = structure(c(17558, 17558, 17558, 17558, 17559, NA, NA, 17559, 17559, 17559, NA, 17558, 17558, NA, NA, NA, 17558, NA, NA, NA, 17558), class = "Date"), result = c(1, 5, 4, 7, 8, NA, NA, 7, 2, 1, NA, 3, 5, NA, NA, NA, 7, NA, NA, NA, 7)), class = "data.frame", row.names = c(NA, -21L))

我尝试将可能丢失的组合分组并插入一行。但这并没有带来预期的结果。

require (tidyverse)
final <- df1 %>%
group_by(id, measure == "first" & lag(measure, 1, default=NA) == "post") %>%
do(add_row(., measure = "pre", .after = 0)) %>%
ungroup()

我也试过了

final <- df1 %>% complete(id, nesting(measure, date))

也许更复杂的是,参与者可以参加多个会议。因此,有可能每个 id 都有 x * (first, post, pre)。

【问题讨论】:

    标签: r tidyverse


    【解决方案1】:

    只需由complete(df1, id, measure) 完成即可。试试这个:

    library(dplyr)
    #> 
    #> Attaching package: 'dplyr'
    #> The following objects are masked from 'package:stats':
    #> 
    #>     filter, lag
    #> The following objects are masked from 'package:base':
    #> 
    #>     intersect, setdiff, setequal, union
    library(tidyr)
    
    df1 <- structure(list(
      id = structure(c(1L, 1L, 1L, 2L, 2L, 3L, 3L, 4L,  4L, 5L, 6L, 7L), 
                     .Label = c("A", "B", "C", "D", "E", "F", "G"), 
                     class = "factor"), 
      measure = structure(c(1L, 3L, 2L, 1L, 3L, 3L, 2L, 1L, 2L, 1L, 3L, 2L), 
                          .Label = c("first", "post", "pre"), 
                          class = "factor"), 
      date = structure(c(17558, 17558, 17558,  17558, 17559, 17559, 17559, 17559, 17558, 17558, 17558, 17558 ), class = "Date"), 
      result = c(1, 5, 4, 7, 8, 7, 2, 1, 3, 5, 7, 7)), class = "data.frame", row.names = c(NA, -12L))
    
    df2 <- structure(list(id = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L, 4L, 4L, 4L, 5L, 5L, 5L, 6L, 6L, 6L, 7L, 7L, 7L), .Label = c("A", "B", "C", "D", "E", "F", "G"), class = "factor"), measure = structure(c(1L, 3L, 2L, 1L, 3L, 2L, 1L, 3L, 2L, 1L, 3L, 2L, 1L, 3L, 2L, 1L, 3L, 2L, 1L, 3L, 2L), .Label = c("first", "post", "pre"), class = "factor"), date = structure(c(17558, 17558, 17558, 17558, 17559, NA, NA, 17559, 17559, 17559, NA, 17558, 17558, NA, NA, NA, 17558, NA, NA, NA, 17558), class = "Date"), result = c(1, 5, 4, 7, 8, NA, NA, 7, 2, 1, NA, 3, 5, NA, NA, NA, 7, NA, NA, NA, 7)), class = "data.frame", row.names = c(NA, -21L))
    
    # Result with complete(df1, id, measure) and setting order of measure
    complete(df1, id, measure) %>% 
      mutate(measure = factor(measure, levels = c("first", "pre", "post"))) %>% 
      arrange(id, measure, date) %>% 
      as.data.frame()
    #>    id measure       date result
    #> 1   A   first 2018-01-27      1
    #> 2   A     pre 2018-01-27      5
    #> 3   A    post 2018-01-27      4
    #> 4   B   first 2018-01-27      7
    #> 5   B     pre 2018-01-28      8
    #> 6   B    post       <NA>     NA
    #> 7   C   first       <NA>     NA
    #> 8   C     pre 2018-01-28      7
    #> 9   C    post 2018-01-28      2
    #> 10  D   first 2018-01-28      1
    #> 11  D     pre       <NA>     NA
    #> 12  D    post 2018-01-27      3
    #> 13  E   first 2018-01-27      5
    #> 14  E     pre       <NA>     NA
    #> 15  E    post       <NA>     NA
    #> 16  F   first       <NA>     NA
    #> 17  F     pre 2018-01-27      7
    #> 18  F    post       <NA>     NA
    #> 19  G   first       <NA>     NA
    #> 20  G     pre       <NA>     NA
    #> 21  G    post 2018-01-27      7
    
    # Desired output
    df2 %>% 
      mutate(measure = factor(measure, levels = c("first", "pre", "post"))) %>% 
      arrange(id, measure, date)
    #>    id measure       date result
    #> 1   A   first 2018-01-27      1
    #> 2   A     pre 2018-01-27      5
    #> 3   A    post 2018-01-27      4
    #> 4   B   first 2018-01-27      7
    #> 5   B     pre 2018-01-28      8
    #> 6   B    post       <NA>     NA
    #> 7   C   first       <NA>     NA
    #> 8   C     pre 2018-01-28      7
    #> 9   C    post 2018-01-28      2
    #> 10  D   first 2018-01-28      1
    #> 11  D     pre       <NA>     NA
    #> 12  D    post 2018-01-27      3
    #> 13  E   first 2018-01-27      5
    #> 14  E     pre       <NA>     NA
    #> 15  E    post       <NA>     NA
    #> 16  F   first       <NA>     NA
    #> 17  F     pre 2018-01-27      7
    #> 18  F    post       <NA>     NA
    #> 19  G   first       <NA>     NA
    #> 20  G     pre       <NA>     NA
    #> 21  G    post 2018-01-27      7
    

    reprex package (v0.3.0) 于 2020 年 3 月 9 日创建

    【讨论】:

    • 看起来是个不错的解决方案。谢谢。必须检查“真实”数据。然而,所需输出的测量顺序不是首先➜前➜后。但是,如果日期变量更精确,这可以实现吗?还是我先重新编码,pre,post到1、2、3?
    • 是的。根据所需的顺序重新编码度量,例如 `factor(measure, levels = c("first", "pre", "post"))。我只是重新排列以比较解决方案。 (;
    • df1 % mutate(measure = factor(measure, levels = c("first", "pre", "post"))) 改变因子水平。但是arrange(id, date, measure) 不会导致正确的顺序,因为完成后缺少“时间”。
    • 嗨,斯莱尔斯。只需更改排列中变量的顺序,即排列(id,测量,日期)。刚刚对我的帖子进行了编辑,以包含重新排序因子的代码以及排列中使用的变量的顺序。
    猜你喜欢
    • 1970-01-01
    • 2021-08-19
    • 1970-01-01
    • 2023-02-20
    • 1970-01-01
    • 2021-05-08
    • 2021-10-20
    • 1970-01-01
    • 2016-01-05
    相关资源
    最近更新 更多