【发布时间】:2020-05-03 07:34:30
【问题描述】:
我有一个嵌套的数据框列表,我只想对名为“train_set”的特定数据框项执行操作
library(data.table)
train_set <- data.frame(
x = c(rep(2, 10)),
y = c(0:9),
z = c(rep("Factor1", 10)))
test_set <- data.frame(
x = c(rep(1, 12)),
y = c(0:11),
z = c(rep("Factor2", 12)))
row.names(train_set) <- c(paste("Observation", c(1:nrow(train_set)), sep = "_"))
row.names(test_set) <- c(paste("Observation", c(1:nrow(test_set)), sep = "_"))
top_list <- list(
aa = list(train_set = train_set, test_set = test_set),
bb = list(train_set = train_set, test_set = test_set),
cc = list(train_set = train_set, test_set = test_set)
)
目标是复制 train_set 中的行,添加一点噪音并相应地命名它们。最后我想返回一个与输入列表具有相同结构的列表 但包含修改后的 train_set 表而不是原始表。 由于我用于这些操作的 dplyr 代码非常慢,因此我在这里得到了帮助,以通过使用 data.table 来提高性能 Speeding up dplyr pipe including checks with mutate_if and if_else on larger tables
但是,为了使用 data.table,我必须将这些特定的 data.frames 转换为 data.tables。 重要的是我将 row.names 保留为“示例”列,因为我需要名称。
# does not work on all elements, not run
# top_list <- lapply(top_list, function(next_level) lapply(next_level, setDT, keep.rownames = "Sample"))
我尝试使用嵌套的 lapply 和 for 循环将 train_set、整个列表或将 train 和 test_set 都更改为 DT,如上面的答案中提到的必须更新列表。但我无法让它适用于这个嵌套列表。该代码似乎适用于第一次迭代,但不适用于之后。有谁知道我如何将所有这些 DF 转换为 DT 并获取代码 低于运行?
result_list <- list()
counter <- 0
for (split_table in top_list) {
counter <- counter +1
current_name <- names(top_list)[counter]
train_tmp <- split_table$train_set
test_tmp <- split_table$test_set
print(current_name)
print(train_tmp)
# either here or earlier turn DF into DT, but keep row.names
setDT(train_tmp, keep.rownames = "Sample")
print(train_tmp) # get's ignored in the first iteration?
# The row names are still present for the first iteration with item "aa"
cols <- names(train_tmp)[sapply(train_tmp, is.numeric)]
# this is the function to copy each row two times, add 10 % noise to each numeric column
# and append the Sample name with the copy number
noised_copies <- lapply(c(1,2), function(n) {
copy(train_tmp)[,
# here we get the error as we need the column "Sample" to adjust the names of the replicated rows
c("Sample", cols) := c(.(paste(Sample, n, sep=".")),
.SD * sample(c(-1.01, 1.01), .N*ncol(.SD), TRUE)),
.SDcols=cols]
})
# combine original table and table with replicates
train_noised <- rbindlist(c(noised_copies, list(train_tmp)), use.names = FALSE)
# turn back into DF and add to result list
setDF(train_noised, rownames = train_noised$Sample)
train_noised$Sample <- NULL
result_list[[current_name]] <- list(train_set = train_noised, test_set = test_tmp)
}
result_list
# it is important to have a clean workspace after each try
rm(top_list)
【问题讨论】:
-
我建议使用大型 data.table,而不是 data.tables 列表。
rbindlist(unlist(top_list, recursive=FALSE), use.names=FALSE, idcol="Table")
标签: r list data.table