【问题标题】:Convert nested list to dataframe: extract only specific elements of interest将嵌套列表转换为数据框:仅提取感兴趣的特定元素
【发布时间】:2021-04-05 04:02:13
【问题描述】:

我见过很多类似的问题,但无法适应我的情况。我有以嵌套列表形式出现的数据,并希望以某种方式将其转换为数据框。

my_data_object <-
  list(my_variables = list(
    age = list(
      type = "numeric",
      originType = "slider",
      originSettings = structure(list(), .Names = character(0)),
      originIndex = 5L,
      title = "what is your age?",
      valueDescriptions = NULL
    ),
    med_field = list(
      type = "string",
      originType = "choice",
      originSettings = structure(list(), .Names = character(0)),
      originIndex = 6L,
      title = "what medical branch are you at?",
      valueDescriptions = list(card = "Cardiology", ophth = "Ophthalmology",
                               derm = "Dermatology")
    ),
    covid_vaccine = list(
      type = "string",
      originType = "choice",
      originSettings = structure(list(), .Names = character(0)),
      originIndex = 8L,
      title = "when do you plan to get vaccinated?",
      valueDescriptions = list(
        next_mo = "No later than next month",
        within_six_mo = "No later than six months from now",
        never = "I will not get vaccinated"
      )
    )
  ))

所需的输出

  var_name      type    originType title                              
  <chr>         <chr>   <chr>      <chr>                              
1 age           numeric slider     what is your age?                  
2 med_field     string  choice     what medical branch are you at?    
3 covid_vaccine string  choice     when do you plan to get vaccinated?

我的失败尝试

library(tibble)
library(tidyr)

my_data_object %>% 
  enframe() %>% 
  unnest_longer(value) %>% 
  unnest(value)

## # A tibble: 18 x 3
##    name         value            value_id     
##    <chr>        <named list>     <chr>        
##  1 my_variables <chr [1]>        age          
##  2 my_variables <chr [1]>        age          
##  3 my_variables <named list [0]> age          
##  4 my_variables <int [1]>        age          
##  5 my_variables <chr [1]>        age          
##  6 my_variables <NULL>           age          
##  7 my_variables <chr [1]>        med_field    
##  8 my_variables <chr [1]>        med_field    
##  9 my_variables <named list [0]> med_field    
## 10 my_variables <int [1]>        med_field    
## 11 my_variables <chr [1]>        med_field    
## 12 my_variables <named list [3]> med_field    
## 13 my_variables <chr [1]>        covid_vaccine
## 14 my_variables <chr [1]>        covid_vaccine
## 15 my_variables <named list [0]> covid_vaccine
## 16 my_variables <int [1]>        covid_vaccine
## 17 my_variables <chr [1]>        covid_vaccine
## 18 my_variables <named list [3]> covid_vaccine

我正在尝试使用tidyverse 函数来实现这一点,但到目前为止,我似乎没有朝着正确的方向前进。我将不胜感激。

编辑

与我最初提供的示例数据不同,实际上我的数据具有不同的层次结构。我认为一旦我有了方法,这将很容易概括,但事实并非如此。因此,如果我们认为数据如下所示,但实际上我只关心my_variables 子列表。

my_data_object_2 <-
  list(
  other_variables = list(
    whatever_var_1 = list(
      type = "numeric",
      originType = "slider",
      originSettings = structure(list(), .Names = character(0)),
      originIndex = 5L,
      title = "blah question",
      valueDescriptions = NULL
    )
  ),
  my_variables = list(
    age = list(
      type = "numeric",
      originType = "slider",
      originSettings = structure(list(), .Names = character(0)),
      originIndex = 5L,
      title = "what is your age?",
      valueDescriptions = NULL
    ),
    med_field = list(
      type = "string",
      originType = "choice",
      originSettings = structure(list(), .Names = character(0)),
      originIndex = 6L,
      title = "what medical branch are you at?",
      valueDescriptions = list(card = "Cardiology", ophth = "Ophthalmology",
                               derm = "Dermatology")
    ),
    covid_vaccine = list(
      type = "string",
      originType = "choice",
      originSettings = structure(list(), .Names = character(0)),
      originIndex = 8L,
      title = "when do you plan to get vaccinated?",
      valueDescriptions = list(
        next_mo = "No later than next month",
        within_six_mo = "No later than six months from now",
        never = "I will not get vaccinated"
      )
    )
  )
)

那么我怎样才能“放大”/“提取”my_variables 并且只有这样才能获得我在上面的“所需输出”中指定的表格?

【问题讨论】:

    标签: r tidyr nested-lists purrr tibble


    【解决方案1】:

    您可以flatten 对象,使用enframeunnest_wider 创建新列。

    library(tidyverse)
    
    my_data_object %>% 
      flatten() %>%
      tibble::enframe() %>%
      unnest_wider(value)
      
    #  name          type    originType originIndex title                               valueDescriptions
    #  <chr>         <chr>   <chr>            <int> <chr>                               <list>           
    #1 age           numeric slider               5 what is your age?                   <NULL>           
    #2 med_field     string  choice               6 what medical branch are you at?     <named list [3]> 
    #3 covid_vaccine string  choice               8 when do you plan to get vaccinated? <named list [3]> 
    

    然后您可以删除不需要的列。


    仅使用my_data_object_2$my_variables

    my_data_object_2$my_variables %>%
      tibble::enframe() %>%
      unnest_wider(value)
    

    【讨论】:

    • 这个tidyvrese 解决方案很简洁。你能看看我最近对帖子的编辑吗?基本上我正在寻找my_data_object 和管道中其他函数之间的中间步骤,因为我只想关注my_variables 子列表。我尝试在此管道中包含purrr::chuck(my_variables),但出现错误。
    • @Emman 您是否尝试将我的答案替换为 my_data_objectmy_data_object_2 ?它确实给出了 4 X 6 的小标题。您可以从那里删除不需要的行/列。
    • 是的,我在my_data_object_2 上尝试过它,它的工作原理与您描述的一样,但如果我可以只关注my_variables 会更干净,它本质上承载了我需要的所有names ,而不是得到一个巨大的 tibble(在我的真实数据中),然后必须使用特定名称 filter。我正在尝试尽可能多地自动化......
    • 完美,谢谢!我用purrr:chuck() 替换了my_data_object_2$my_variables,这样管道现在是:my_data_object_2 %&gt;% purrr:chuck("my_variables") %&gt;% tibble::enframe() %&gt;% tidyr::unnest_wider(value)
    【解决方案2】:

    迭代my_data_object tibblifying 指定的列并使用map_dfr 将它们放在一起(或者可能fun(my_data_object$my_variables) 就足够了,取决于一般情况是什么)。示例数据中没有缺失字段,但如果 3 个规范字段中的任何一个可能缺失,则将 .default = NA 作为 lcol_chr 参数添加到该字段规范。

    library(purrr)
    library(tibblify)
    
    spec <-  lcols(
      lcol_chr("type"),
      lcol_chr("originType"),
      lcol_chr("title")
    )
    fun <- function(x) cbind(var_name = names(x), tibblify(x, spec))
    
    map_dfr(my_data_object, fun)
    

    给予:

           var_name    type originType                               title
    1           age numeric     slider                   what is your age?
    2     med_field  string     choice     what medical branch are you at?
    3 covid_vaccine  string     choice when do you plan to get vaccinated?
    

    根据一般情况,@mgirlich 的这种简化(类似于此答案介绍中的替代方法)可能会起作用。 spec 来自上方。

    library(tibblify)
    
    cbind(
      var_name = names(my_data_object[[1]]),
      tibblify(my_data_object[[1]], spec)
    )
    

    【讨论】:

      【解决方案3】:

      像往常一样使用lapply 选择特定列,只需rbind 它们。

      res <- do.call(rbind.data.frame, 
                     lapply((my_data_object)[[1]], `[`, c("type", "originType", "title")))
      res
      #                  type originType                               title
      # age           numeric     slider                   what is your age?
      # med_field      string     choice     what medical branch are you at?
      # covid_vaccine  string     choice when do you plan to get vaccinated?
      

      如果要将行名放在第一列,请执行以下操作:

      `rownames<-`(cbind(var=rownames(res), res), NULL)
      #             var    type originType                               title
      # 1           age numeric     slider                   what is your age?
      # 2     med_field  string     choice     what medical branch are you at?
      # 3 covid_vaccine  string     choice when do you plan to get vaccinated?
      

      【讨论】:

        猜你喜欢
        • 2012-07-13
        • 2015-03-11
        • 1970-01-01
        • 1970-01-01
        • 2020-06-22
        • 1970-01-01
        • 2012-06-28
        • 1970-01-01
        相关资源
        最近更新 更多