将嵌套列表转换为数据框：仅提取感兴趣的特定元素答案

【问题标题】：Convert nested list to dataframe: extract only specific elements of interest将嵌套列表转换为数据框：仅提取感兴趣的特定元素
【发布时间】：2021-04-05 04:02:13
【问题描述】：

我见过很多类似的问题，但无法适应我的情况。我有以嵌套列表形式出现的数据，并希望以某种方式将其转换为数据框。

my_data_object <-
  list(my_variables = list(
    age = list(
      type = "numeric",
      originType = "slider",
      originSettings = structure(list(), .Names = character(0)),
      originIndex = 5L,
      title = "what is your age?",
      valueDescriptions = NULL
    ),
    med_field = list(
      type = "string",
      originType = "choice",
      originSettings = structure(list(), .Names = character(0)),
      originIndex = 6L,
      title = "what medical branch are you at?",
      valueDescriptions = list(card = "Cardiology", ophth = "Ophthalmology",
                               derm = "Dermatology")
    ),
    covid_vaccine = list(
      type = "string",
      originType = "choice",
      originSettings = structure(list(), .Names = character(0)),
      originIndex = 8L,
      title = "when do you plan to get vaccinated?",
      valueDescriptions = list(
        next_mo = "No later than next month",
        within_six_mo = "No later than six months from now",
        never = "I will not get vaccinated"
      )
    )
  ))

所需的输出

  var_name      type    originType title                              
  <chr>         <chr>   <chr>      <chr>                              
1 age           numeric slider     what is your age?                  
2 med_field     string  choice     what medical branch are you at?    
3 covid_vaccine string  choice     when do you plan to get vaccinated?

我的失败尝试

library(tibble)
library(tidyr)

my_data_object %>% 
  enframe() %>% 
  unnest_longer(value) %>% 
  unnest(value)

## # A tibble: 18 x 3
##    name         value            value_id     
##    <chr>        <named list>     <chr>        
##  1 my_variables <chr [1]>        age          
##  2 my_variables <chr [1]>        age          
##  3 my_variables <named list [0]> age          
##  4 my_variables <int [1]>        age          
##  5 my_variables <chr [1]>        age          
##  6 my_variables <NULL>           age          
##  7 my_variables <chr [1]>        med_field    
##  8 my_variables <chr [1]>        med_field    
##  9 my_variables <named list [0]> med_field    
## 10 my_variables <int [1]>        med_field    
## 11 my_variables <chr [1]>        med_field    
## 12 my_variables <named list [3]> med_field    
## 13 my_variables <chr [1]>        covid_vaccine
## 14 my_variables <chr [1]>        covid_vaccine
## 15 my_variables <named list [0]> covid_vaccine
## 16 my_variables <int [1]>        covid_vaccine
## 17 my_variables <chr [1]>        covid_vaccine
## 18 my_variables <named list [3]> covid_vaccine

我正在尝试使用tidyverse 函数来实现这一点，但到目前为止，我似乎没有朝着正确的方向前进。我将不胜感激。

编辑

与我最初提供的示例数据不同，实际上我的数据具有不同的层次结构。我认为一旦我有了方法，这将很容易概括，但事实并非如此。因此，如果我们认为数据如下所示，但实际上我只关心my_variables 子列表。

my_data_object_2 <-
  list(
  other_variables = list(
    whatever_var_1 = list(
      type = "numeric",
      originType = "slider",
      originSettings = structure(list(), .Names = character(0)),
      originIndex = 5L,
      title = "blah question",
      valueDescriptions = NULL
    )
  ),
  my_variables = list(
    age = list(
      type = "numeric",
      originType = "slider",
      originSettings = structure(list(), .Names = character(0)),
      originIndex = 5L,
      title = "what is your age?",
      valueDescriptions = NULL
    ),
    med_field = list(
      type = "string",
      originType = "choice",
      originSettings = structure(list(), .Names = character(0)),
      originIndex = 6L,
      title = "what medical branch are you at?",
      valueDescriptions = list(card = "Cardiology", ophth = "Ophthalmology",
                               derm = "Dermatology")
    ),
    covid_vaccine = list(
      type = "string",
      originType = "choice",
      originSettings = structure(list(), .Names = character(0)),
      originIndex = 8L,
      title = "when do you plan to get vaccinated?",
      valueDescriptions = list(
        next_mo = "No later than next month",
        within_six_mo = "No later than six months from now",
        never = "I will not get vaccinated"
      )
    )
  )
)

那么我怎样才能“放大”/“提取”my_variables 并且只有这样才能获得我在上面的“所需输出”中指定的表格？

【问题讨论】：

标签： r tidyr nested-lists purrr tibble

【解决方案1】：

您可以flatten 对象，使用enframe 和unnest_wider 创建新列。

library(tidyverse)

my_data_object %>% 
  flatten() %>%
  tibble::enframe() %>%
  unnest_wider(value)
  
#  name          type    originType originIndex title                               valueDescriptions
#  <chr>         <chr>   <chr>            <int> <chr>                               <list>           
#1 age           numeric slider               5 what is your age?                   <NULL>           
#2 med_field     string  choice               6 what medical branch are you at?     <named list [3]> 
#3 covid_vaccine string  choice               8 when do you plan to get vaccinated? <named list [3]>

然后您可以删除不需要的列。

仅使用my_data_object_2$my_variables：

my_data_object_2$my_variables %>%
  tibble::enframe() %>%
  unnest_wider(value)

【讨论】：

这个tidyvrese 解决方案很简洁。你能看看我最近对帖子的编辑吗？基本上我正在寻找my_data_object 和管道中其他函数之间的中间步骤，因为我只想关注my_variables 子列表。我尝试在此管道中包含purrr::chuck(my_variables)，但出现错误。
@Emman 您是否尝试将我的答案替换为 my_data_object 与 my_data_object_2 ？它确实给出了 4 X 6 的小标题。您可以从那里删除不需要的行/列。
是的，我在my_data_object_2 上尝试过它，它的工作原理与您描述的一样，但如果我可以只关注my_variables 会更干净，它本质上承载了我需要的所有names ，而不是得到一个巨大的 tibble（在我的真实数据中），然后必须使用特定名称 filter。我正在尝试尽可能多地自动化......
完美，谢谢！我用purrr:chuck() 替换了my_data_object_2$my_variables，这样管道现在是：my_data_object_2 %>% purrr:chuck("my_variables") %>% tibble::enframe() %>% tidyr::unnest_wider(value)

【解决方案2】：

迭代my_data_object tibblifying 指定的列并使用map_dfr 将它们放在一起（或者可能fun(my_data_object$my_variables) 就足够了，取决于一般情况是什么）。示例数据中没有缺失字段，但如果 3 个规范字段中的任何一个可能缺失，则将 .default = NA 作为 lcol_chr 参数添加到该字段规范。

library(purrr)
library(tibblify)

spec <-  lcols(
  lcol_chr("type"),
  lcol_chr("originType"),
  lcol_chr("title")
)
fun <- function(x) cbind(var_name = names(x), tibblify(x, spec))

map_dfr(my_data_object, fun)

给予：

       var_name    type originType                               title
1           age numeric     slider                   what is your age?
2     med_field  string     choice     what medical branch are you at?
3 covid_vaccine  string     choice when do you plan to get vaccinated?

根据一般情况，@mgirlich 的这种简化（类似于此答案介绍中的替代方法）可能会起作用。 spec 来自上方。

library(tibblify)

cbind(
  var_name = names(my_data_object[[1]]),
  tibblify(my_data_object[[1]], spec)
)

【讨论】：

【解决方案3】：

像往常一样使用lapply 选择特定列，只需rbind 它们。

res <- do.call(rbind.data.frame, 
               lapply((my_data_object)[[1]], `[`, c("type", "originType", "title")))
res
#                  type originType                               title
# age           numeric     slider                   what is your age?
# med_field      string     choice     what medical branch are you at?
# covid_vaccine  string     choice when do you plan to get vaccinated?

如果要将行名放在第一列，请执行以下操作：

`rownames<-`(cbind(var=rownames(res), res), NULL)
#             var    type originType                               title
# 1           age numeric     slider                   what is your age?
# 2     med_field  string     choice     what medical branch are you at?
# 3 covid_vaccine  string     choice when do you plan to get vaccinated?

【讨论】：