【问题标题】:Extracting data from a list of lists into its own `data.frame` with `purrr`使用 purrr 将列表中的数据提取到自己的 data.frame 中
【发布时间】:2017-07-25 21:32:29
【问题描述】:

代表性样本数据(列表列表):

l <- list(structure(list(a = -1.54676469632688, b = "s", c = "T", 
d = structure(list(id = 5L, label = "Utah", link = "Asia/Anadyr", 
    score = -0.21104594634643), .Names = c("id", "label", 
"link", "score")), e = 49.1279871269422), .Names = c("a", 
"b", "c", "d", "e")), structure(list(a = -0.934821052832427, 
b = "k", c = "T", d = list(structure(list(id = 8L, label = "South Carolina", 
    link = "Pacific/Wallis", score = 0.526540892113734, externalId = -6.74354377676955), .Names = c("id", 
"label", "link", "score", "externalId")), structure(list(
    id = 9L, label = "Nebraska", link = "America/Scoresbysund", 
    score = 0.250895465294041, externalId = 16.4257470807879), .Names = c("id", 
"label", "link", "score", "externalId"))), e = 52.3161400117052), .Names = c("a", 
"b", "c", "d", "e")), structure(list(a = -0.27261485993069, b = "f", 
c = "P", d = list(structure(list(id = 8L, label = "Georgia", 
    link = "America/Nome", score = 0.526494135483816, externalId = 7.91583574935589), .Names = c("id", 
"label", "link", "score", "externalId")), structure(list(
    id = 2L, label = "Washington", link = "America/Shiprock", 
    score = -0.555186440792989, externalId = 15.0686663219837), .Names = c("id", 
"label", "link", "score", "externalId")), structure(list(
    id = 6L, label = "North Dakota", link = "Universal", 
    score = 1.03168296038975), .Names = c("id", "label", 
"link", "score")), structure(list(id = 1L, label = "New Hampshire", 
    link = "America/Cordoba", score = 1.21582056168681, externalId = 9.7276418869132), .Names = c("id", 
"label", "link", "score", "externalId")), structure(list(
    id = 1L, label = "Alaska", link = "Asia/Istanbul", score = -0.23183264861979), .Names = c("id", 
"label", "link", "score")), structure(list(id = 4L, label = "Pennsylvania", 
    link = "Africa/Dar_es_Salaam", score = 0.590245339334121), .Names = c("id", 
"label", "link", "score"))), e = 132.1153538536), .Names = c("a", 
"b", "c", "d", "e")), structure(list(a = 0.202685974077313, b = "x", 
c = "O", d = structure(list(id = 3L, label = "Delaware", 
    link = "Asia/Samarkand", score = 0.695577130634724, externalId = 15.2364820698193), .Names = c("id", 
"label", "link", "score", "externalId")), e = 97.9908914452971), .Names = c("a", 
"b", "c", "d", "e")), structure(list(a = -0.396243444741009, 
b = "z", c = "P", d = list(structure(list(id = 4L, label = "North Dakota", 
    link = "America/Tortola", score = 1.03060272795705, externalId = -7.21666936522344), .Names = c("id", 
"label", "link", "score", "externalId")), structure(list(
    id = 9L, label = "Nebraska", link = "America/Ojinaga", 
    score = -1.11397997280413, externalId = -8.45145052697411), .Names = c("id", 
"label", "link", "score", "externalId"))), e = 123.597945533926), .Names = c("a", 
"b", "c", "d", "e")))

借助 JSON 数据下载,我有一个列表列表。

该列表有 176 个元素,每个元素有 33 个嵌套元素,其中一些也是不同长度的列表。

我有兴趣分析特定嵌套列表中包含的数据,对于 176 个具有 4 个或 5 个元素的每个元素的长度约为 150 - 有些有 4 个,有些有 5 个。我正在尝试提取此嵌套的感兴趣列表并将其转换为 data.frame 以便能够执行一些分析。

在上面的代表性示例数据中,我对l 的5 个元素中的每一个的嵌套列表d 感兴趣。因此,所需的data.frame 看起来像:

id           label            link       score  externalId
 5            Utah     Asia/Anadyr  -0.2110459          NA
 8  South Carolina  Pacific/Wallis   0.5265409   -6.743544
 .
 .

我一直在尝试使用purrr,它似乎有一个合理且一致的流程来处理列表中的数据,但我遇到了我无法完全理解原因的错误——很可能是我没有正确理解purrr 或列表(可能两者都有)的命令/逻辑。这是我一直在尝试但引发相关错误的代码:

df <- map_df(l, "d", ~as.data.frame(.))
Error: incompatible sizes (5 != 4)

我相信这与每个组件的 d 的不同长度有关,或者可能包含不同的数据(有时 4 个元素有时 5 个),或者我在这里使用的函数可能是错误指定的——说实话,我我不完全确定。

我已经通过使用 for 循环解决了这个问题,我知道这是低效的,因此我的问题是关于 SO。

这是我目前使用的 for 循环:

df <- data.frame(id = integer(), label = character(), score = numeric(), externalId = numeric())
for(i in seq_along(l)){
    df_temp <- l[[i]][[4]] %>% map_df(~as.data.frame(.))
    df <- rbind(df, df_temp)
}

最好使用purrr 提供一些帮助 - 或者apply 的某些版本,因为这仍然优于我的 for 循环 - 将不胜感激。另外,如果有上述资源,我想了解而不是仅仅找到正确的代码。

【问题讨论】:

    标签: r list dplyr purrr


    【解决方案1】:

    您可以分三步完成此操作,首先拉出d,然后绑定d 的每个元素内的行,然后将所有内容绑定到一个对象中。

    我使用 dplyr 中的bind_rows 进行列表内行绑定。 map_df 执行最后一行绑定。

    library(purrr)
    library(dplyr)
    
    l %>%
        map("d") %>%
        map_df(bind_rows)
    

    这也是等价的:

    map_df(l, ~bind_rows(.x[["d"]] ) )
    

    结果如下:

    # A tibble: 12 x 5
          id          label                 link      score externalId
       <int>          <chr>                <chr>      <dbl>      <dbl>
     1     5           Utah          Asia/Anadyr -0.2110459         NA
     2     8 South Carolina       Pacific/Wallis  0.5265409  -6.743544
     3     9       Nebraska America/Scoresbysund  0.2508955  16.425747
     4     8        Georgia         America/Nome  0.5264941   7.915836
     5     2     Washington     America/Shiprock -0.5551864  15.068666
     6     6   North Dakota            Universal  1.0316830         NA
     7     1  New Hampshire      America/Cordoba  1.2158206   9.727642
     8     1         Alaska        Asia/Istanbul -0.2318326         NA
     9     4   Pennsylvania Africa/Dar_es_Salaam  0.5902453         NA
    10     3       Delaware       Asia/Samarkand  0.6955771  15.236482
    11     4   North Dakota      America/Tortola  1.0306027  -7.216669
    12     9       Nebraska      America/Ojinaga -1.1139800  -8.451451
    

    【讨论】:

      【解决方案2】:

      有关 purrr 的更多信息,我推荐 Grolemund 和 Wickham 的“R for Data Science”http://r4ds.had.co.nz/

      我认为您面临的一个问题是l$d 中的某些项目是变量列表,每个变量都有一个观察值,可以转换为数据框,而其他项目是此类列表的列表。

      但我自己并不擅长发出咕噜声。以下是我的做法:

      l <- lapply(l, function(x){x$d}) ## work with the data you need.
      
      list_of_observations <- Filter(function(x) {!is.null(names(x))},l)
      
      list_of_lists <- Filter(function(x) {is.null(names(x))}, l)
      
      another_list_of_observations <- unlist(list_of_lists, recursive=FALSE)
      
      df <- lapply(c(list_of_observations, another_list_of_observations),
                   as.data.frame) %>% bind_rows
      

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 2019-01-13
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 2020-10-18
        相关资源
        最近更新 更多