嵌套列表中的 lapply data.table setDT 不起作用或不是幂等的？答案

【问题标题】：lapply data.table setDT in nested lists does not work or is not idempotent?嵌套列表中的 lapply data.table setDT 不起作用或不是幂等的？
【发布时间】：2020-05-03 07:34:30
【问题描述】：

我有一个嵌套的数据框列表，我只想对名为“train_set”的特定数据框项执行操作

library(data.table)

train_set <- data.frame(
  x = c(rep(2, 10)), 
  y = c(0:9), 
  z = c(rep("Factor1", 10)))

test_set <- data.frame(
  x = c(rep(1, 12)), 
  y = c(0:11), 
  z = c(rep("Factor2", 12)))

row.names(train_set) <- c(paste("Observation", c(1:nrow(train_set)), sep = "_"))
row.names(test_set) <- c(paste("Observation", c(1:nrow(test_set)), sep = "_"))

top_list <- list(
  aa = list(train_set = train_set, test_set = test_set), 
  bb = list(train_set = train_set, test_set = test_set), 
  cc = list(train_set = train_set, test_set = test_set)
)

目标是复制 train_set 中的行，添加一点噪音并相应地命名它们。最后我想返回一个与输入列表具有相同结构的列表但包含修改后的 train_set 表而不是原始表。由于我用于这些操作的 dplyr 代码非常慢，因此我在这里得到了帮助，以通过使用 data.table 来提高性能 Speeding up dplyr pipe including checks with mutate_if and if_else on larger tables

但是，为了使用 data.table，我必须将这些特定的 data.frames 转换为 data.tables。重要的是我将 row.names 保留为“示例”列，因为我需要名称。

# does not work on all elements, not run
# top_list <- lapply(top_list, function(next_level) lapply(next_level, setDT, keep.rownames = "Sample"))

我尝试使用嵌套的 lapply 和 for 循环将 train_set、整个列表或将 train 和 test_set 都更改为 DT，如上面的答案中提到的必须更新列表。但我无法让它适用于这个嵌套列表。该代码似乎适用于第一次迭代，但不适用于之后。有谁知道我如何将所有这些 DF 转换为 DT 并获取代码低于运行？

result_list <- list()
counter <- 0
for (split_table in top_list) {
  counter <- counter +1
  current_name <- names(top_list)[counter]
  train_tmp <- split_table$train_set
  test_tmp <- split_table$test_set
  print(current_name)
  print(train_tmp)

  # either here or earlier turn DF into DT, but keep row.names
  setDT(train_tmp, keep.rownames = "Sample")
  print(train_tmp)  # get's ignored in the first iteration?
  # The row names are still present for the first iteration with item "aa"
  cols <- names(train_tmp)[sapply(train_tmp, is.numeric)]
  # this is the function to copy each row two times, add 10 % noise to each numeric column 
  # and append the Sample name with the copy number
  noised_copies <- lapply(c(1,2), function(n) {
    copy(train_tmp)[,
      # here we get the error as we need the column "Sample" to adjust the names of the replicated rows
      c("Sample", cols) := c(.(paste(Sample, n, sep=".")), 
        .SD * sample(c(-1.01, 1.01), .N*ncol(.SD), TRUE)),
      .SDcols=cols]
  })
  # combine original table and table with replicates
  train_noised <- rbindlist(c(noised_copies, list(train_tmp)), use.names = FALSE)
  # turn back into DF and add to result list
  setDF(train_noised, rownames = train_noised$Sample)
  train_noised$Sample <- NULL
  result_list[[current_name]] <- list(train_set = train_noised, test_set = test_tmp)
}
result_list
# it is important to have a clean workspace after each try
rm(top_list)

【问题讨论】：

我建议使用大型 data.table，而不是 data.tables 列表。 rbindlist(unlist(top_list, recursive=FALSE), use.names=FALSE, idcol="Table")

标签： r list data.table

【解决方案1】：

我也很难适应工作。我可以让它把数据框变成数据表，但它拒绝保留行名。

我发现一个简单的双循环可以工作。它可能会在覆盖数据帧之前对其进行复制，所以我不知道这是否足以满足您的需求。使用我的机器处理您的数据似乎需要大约 6 毫秒。

for(i in 1:3) 
  for(j in 1:2) 
    top_list[[i]][[j]] <- as.data.table(top_list[[i]][[j]], keep.rownames = "Sample")

这给了

top_list
#> $`aa`
#> $`aa`$`train_set`
#>             Sample x y       z
#>  1:  Observation_1 2 0 Factor1
#>  2:  Observation_2 2 1 Factor1
#>  3:  Observation_3 2 2 Factor1
#>  4:  Observation_4 2 3 Factor1
#>  5:  Observation_5 2 4 Factor1
#>  6:  Observation_6 2 5 Factor1
#>  7:  Observation_7 2 6 Factor1
#>  8:  Observation_8 2 7 Factor1
#>  9:  Observation_9 2 8 Factor1
#> 10: Observation_10 2 9 Factor1
#> 
#> $`aa`$test_set
#>             Sample x  y       z
#>  1:  Observation_1 1  0 Factor2
#>  2:  Observation_2 1  1 Factor2
#>  3:  Observation_3 1  2 Factor2
#>  4:  Observation_4 1  3 Factor2
#>  5:  Observation_5 1  4 Factor2
#>  6:  Observation_6 1  5 Factor2
#>  7:  Observation_7 1  6 Factor2
#>  8:  Observation_8 1  7 Factor2
#>  9:  Observation_9 1  8 Factor2
#> 10: Observation_10 1  9 Factor2
#> 11: Observation_11 1 10 Factor2
#> 12: Observation_12 1 11 Factor2
#> 
#> 
#> $bb
#> $bb$`train_set`
#>             Sample x y       z
#>  1:  Observation_1 2 0 Factor1
#>  2:  Observation_2 2 1 Factor1
#>  3:  Observation_3 2 2 Factor1
#>  4:  Observation_4 2 3 Factor1
#>  5:  Observation_5 2 4 Factor1
#>  6:  Observation_6 2 5 Factor1
#>  7:  Observation_7 2 6 Factor1
#>  8:  Observation_8 2 7 Factor1
#>  9:  Observation_9 2 8 Factor1
#> 10: Observation_10 2 9 Factor1
#> 
#> $bb$test_set
#>             Sample x  y       z
#>  1:  Observation_1 1  0 Factor2
#>  2:  Observation_2 1  1 Factor2
#>  3:  Observation_3 1  2 Factor2
#>  4:  Observation_4 1  3 Factor2
#>  5:  Observation_5 1  4 Factor2
#>  6:  Observation_6 1  5 Factor2
#>  7:  Observation_7 1  6 Factor2
#>  8:  Observation_8 1  7 Factor2
#>  9:  Observation_9 1  8 Factor2
#> 10: Observation_10 1  9 Factor2
#> 11: Observation_11 1 10 Factor2
#> 12: Observation_12 1 11 Factor2
#> 
#> 
#> $cc
#> $cc$`train_set`
#>             Sample x y       z
#>  1:  Observation_1 2 0 Factor1
#>  2:  Observation_2 2 1 Factor1
#>  3:  Observation_3 2 2 Factor1
#>  4:  Observation_4 2 3 Factor1
#>  5:  Observation_5 2 4 Factor1
#>  6:  Observation_6 2 5 Factor1
#>  7:  Observation_7 2 6 Factor1
#>  8:  Observation_8 2 7 Factor1
#>  9:  Observation_9 2 8 Factor1
#> 10: Observation_10 2 9 Factor1
#> 
#> $cc$test_set
#>             Sample x  y       z
#>  1:  Observation_1 1  0 Factor2
#>  2:  Observation_2 1  1 Factor2
#>  3:  Observation_3 1  2 Factor2
#>  4:  Observation_4 1  3 Factor2
#>  5:  Observation_5 1  4 Factor2
#>  6:  Observation_6 1  5 Factor2
#>  7:  Observation_7 1  6 Factor2
#>  8:  Observation_8 1  7 Factor2
#>  9:  Observation_9 1  8 Factor2
#> 10: Observation_10 1  9 Factor2
#> 11: Observation_11 1 10 Factor2
#> 12: Observation_12 1 11 Factor2

【讨论】：

我会在我的真实数据上试试这个。我有同样的想法，但无法以保持列表结构的方式分配 for 循环的输出
我认为这确实有效。可能会有更快的实现，但与我之前的dplyr 版本相比，这对于我的需求来说已经足够快了。我将代码调整为for (i in c(1:length(top_list)) for (j in c(1)) 以概括外循环，只调整内循环中的第一个表（train_set）。不过，我仍然想知道lapply 调用有什么问题？
我没有尝试在 lapply 内重新分配 - 你可以试试。我经常使用 lapply，但我认为循环比这里的嵌套 lapply 更干净，因为调用是单行的，所以你不需要括号