使用 dplyr 根据时间序列数据中的特定因子水平创建新变量答案

【问题标题】：Create new variables based on specific factor levels in time series data with dplyr使用 dplyr 根据时间序列数据中的特定因子水平创建新变量
【发布时间】：2021-06-08 20:27:39
【问题描述】：

我有一些时间序列数据，其中序列的步骤（范围从 1 到 8）及其主题（>100）都被编码为单个变量中的字符因子级别。这是一个最小的示例（我省略了每个 id 中会增加的时间戳）：

id <- c(1,rep(2,5),rep(3,4))
step <- c("call", "call", "agent", "forest", "forward", "resolved", "call", "agent", "beach", "resolved")
(df <- data.frame(id,step))
   id     step
1   1     call
2   2     call
3   2    agent
4   2   forest
5   2  forward
6   2 resolved
7   3     call
8   3    agent
9   3    beach
10  3 resolved

我现在想将此信息拆分为两个专用变量（步骤和主题），从而将数据框缩小为行并使其更宽，同时还为时间序列的每一行重复主题并添加“NA”没有话题。使用 base R 将其拆分为两个数据帧并将它们重新合并在一起即可完成工作：

step <- subset(df, step %in% c("call", "agent", "forward", "resolved"))
topic <- subset(df, step %in% c("forest", "beach"))
topic$topic <- topic$step
topic$step <- NULL
(newdf <- merge(step,topic, all=TRUE))
  id     step  topic
1  1     call   <NA>
2  2     call forest
3  2    agent forest
4  2  forward forest
5  2 resolved forest
6  3     call  beach
7  3    agent  beach
8  3 resolved  beach

虽然这有点笨拙，但我正在寻找一种更优雅的 dplyr/tidyverse 方法。 pivot_wider() 似乎无法做到这一点。有什么想法吗？

【问题讨论】：

标签： r dplyr

【解决方案1】：

这不是特别优雅，但很有效：

steps <- c("call", "agent", "forward", "resolved")
df %>%
  mutate(type = ifelse(step %in% steps, "step", "topic"),
         row = cumsum(type == "step")) %>%
  pivot_wider(names_from = type, values_from = step) %>%
  group_by(id) %>%
  fill(topic, .direction = "updown") %>% 
  ungroup()



# A tibble: 8 x 4
     id   row step     topic 
  <dbl> <int> <chr>    <chr> 
1     1     1 call     NA    
2     2     2 call     forest
3     2     3 agent    forest
4     2     4 forward  forest
5     2     5 resolved forest
6     3     6 call     beach 
7     3     7 agent    beach 
8     3     8 resolved beach

【讨论】：

【解决方案2】：

感谢您提供问题的最小示例

id <- c(1,rep(2,5),rep(3,4))
step <- c("call", "call", "agent", "forest", "forward",
  "resolved", "call", "agent", "beach", "resolved")
df <- data.frame(id,step)
df
#>    id     step
#> 1   1     call
#> 2   2     call
#> 3   2    agent
#> 4   2   forest
#> 5   2  forward
#> 6   2 resolved
#> 7   3     call
#> 8   3    agent
#> 9   3    beach
#> 10  3 resolved

这是使用 tidyverse 的可能解决方案

library(dplyr)
library(tidyr)

df %>% 
  # define in column type_c if step is an step or a topic
  # you need a unique id for each row to use pivot_wider in this case
  mutate(
    type_c = if_else(step %in% c("forest", "beach"), "topic", "step"), 
    unique_id = 1:nrow(df)) %>% 
  pivot_wider(names_from = type_c, values_from = c(id, step)) %>% 
  mutate(id = coalesce(id_step, id_topic)) %>%
  select(id, step = step_step, topic = step_topic) %>% 
  # Need group_by to apply the function fill 
  group_by(id) %>%
  # fill replaces NA, in each id,  with a value found in any direction "downup"
  fill(topic, .direction = "downup") %>% 
  # get rid off the NA in column step that pivot_wider created for each topic
  filter(!is.na(step)) 
#> # A tibble: 8 x 3
#> # Groups:   id [3]
#>      id step     topic 
#>   <dbl> <chr>    <chr> 
#> 1     1 call     <NA>  
#> 2     2 call     forest
#> 3     2 agent    forest
#> 4     2 forward  forest
#> 5     2 resolved forest
#> 6     3 call     beach 
#> 7     3 agent    beach 
#> 8     3 resolved beach

^{由reprex package (v0.3.0) 于 2021-06-08 创建}

【讨论】：