【问题标题】:Create new variables based on specific factor levels in time series data with dplyr使用 dplyr 根据时间序列数据中的特定因子水平创建新变量
【发布时间】:2021-06-08 20:27:39
【问题描述】:

我有一些时间序列数据,其中序列的步骤(范围从 1 到 8)及其主题(>100)都被编码为单个变量中的字符因子级别。这是一个最小的示例(我省略了每个 id 中会增加的时间戳):

id <- c(1,rep(2,5),rep(3,4))
step <- c("call", "call", "agent", "forest", "forward", "resolved", "call", "agent", "beach", "resolved")
(df <- data.frame(id,step))
   id     step
1   1     call
2   2     call
3   2    agent
4   2   forest
5   2  forward
6   2 resolved
7   3     call
8   3    agent
9   3    beach
10  3 resolved

我现在想将此信息拆分为两个专用变量(步骤和主题),从而将数据框缩小为行并使其更宽,同时还为时间序列的每一行重复主题并添加“NA”没有话题。使用 base R 将其拆分为两个数据帧并将它们重新合并在一起即可完成工作:

step <- subset(df, step %in% c("call", "agent", "forward", "resolved"))
topic <- subset(df, step %in% c("forest", "beach"))
topic$topic <- topic$step
topic$step <- NULL
(newdf <- merge(step,topic, all=TRUE))
  id     step  topic
1  1     call   <NA>
2  2     call forest
3  2    agent forest
4  2  forward forest
5  2 resolved forest
6  3     call  beach
7  3    agent  beach
8  3 resolved  beach

虽然这有点笨拙,但我正在寻找一种更优雅的 dplyr/tidyverse 方法。 pivot_wider() 似乎无法做到这一点。有什么想法吗?

【问题讨论】:

    标签: r dplyr


    【解决方案1】:

    这不是特别优雅,但很有效:

    steps <- c("call", "agent", "forward", "resolved")
    df %>%
      mutate(type = ifelse(step %in% steps, "step", "topic"),
             row = cumsum(type == "step")) %>%
      pivot_wider(names_from = type, values_from = step) %>%
      group_by(id) %>%
      fill(topic, .direction = "updown") %>% 
      ungroup()
    
    
    
    # A tibble: 8 x 4
         id   row step     topic 
      <dbl> <int> <chr>    <chr> 
    1     1     1 call     NA    
    2     2     2 call     forest
    3     2     3 agent    forest
    4     2     4 forward  forest
    5     2     5 resolved forest
    6     3     6 call     beach 
    7     3     7 agent    beach 
    8     3     8 resolved beach 
    

    【讨论】:

      【解决方案2】:

      感谢您提供问题的最小示例

      id <- c(1,rep(2,5),rep(3,4))
      step <- c("call", "call", "agent", "forest", "forward",
        "resolved", "call", "agent", "beach", "resolved")
      df <- data.frame(id,step)
      df
      #>    id     step
      #> 1   1     call
      #> 2   2     call
      #> 3   2    agent
      #> 4   2   forest
      #> 5   2  forward
      #> 6   2 resolved
      #> 7   3     call
      #> 8   3    agent
      #> 9   3    beach
      #> 10  3 resolved
      

      这是使用 tidyverse 的可能解决方案

      library(dplyr)
      library(tidyr)
      
      df %>% 
        # define in column type_c if step is an step or a topic
        # you need a unique id for each row to use pivot_wider in this case
        mutate(
          type_c = if_else(step %in% c("forest", "beach"), "topic", "step"), 
          unique_id = 1:nrow(df)) %>% 
        pivot_wider(names_from = type_c, values_from = c(id, step)) %>% 
        mutate(id = coalesce(id_step, id_topic)) %>%
        select(id, step = step_step, topic = step_topic) %>% 
        # Need group_by to apply the function fill 
        group_by(id) %>%
        # fill replaces NA, in each id,  with a value found in any direction "downup"
        fill(topic, .direction = "downup") %>% 
        # get rid off the NA in column step that pivot_wider created for each topic
        filter(!is.na(step)) 
      #> # A tibble: 8 x 3
      #> # Groups:   id [3]
      #>      id step     topic 
      #>   <dbl> <chr>    <chr> 
      #> 1     1 call     <NA>  
      #> 2     2 call     forest
      #> 3     2 agent    forest
      #> 4     2 forward  forest
      #> 5     2 resolved forest
      #> 6     3 call     beach 
      #> 7     3 agent    beach 
      #> 8     3 resolved beach
      

      reprex package (v0.3.0) 于 2021-06-08 创建

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 2023-03-07
        • 1970-01-01
        • 1970-01-01
        相关资源
        最近更新 更多