根据两个 data.frames/data.tables 计算因子级别的新列答案

【问题标题】：calculate new column on factor level based on two data.frames/data.tables根据两个 data.frames/data.tables 计算因子级别的新列
【发布时间】：2018-06-14 13:26:39
【问题描述】：

我正在尝试计算 data.table dt 的新列的值。计算的一部分来自 data.frame df（也可以是 data.table，到目前为止我还不需要它）。

如果因子级别（此处为：sample）匹配，我如何使用来自两个不同对象的值来计算新列？我曾经将两个对象合并并逐行进行，但这会导致大量冗余数据。

这是data.frame，只有10行：

df

    sample scaling_factor
A1      A1      111956565
A2      A2       89869320
A3      A3      120925219
A4      A4      111757559
A5      A5       77319341
A6      A6       89403194
A7      A7      150214981
B8      B8      133885925
B9      B9       86536587
B10    B10      123574939


df <- structure(list(sample = structure(c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 
9L, 10L, 8L), .Label = c("A1", "A2", "A3", "A4", "A5", "A6", 
"A7", "B10", "B8", "B9"), class = "factor"), scaling_factor = c(111956565.427018, 
89869319.9348599, 120925219.4453, 111757558.886234, 77319340.5841949, 
89403194.1170576, 150214980.784589, 133885925.080984, 86536586.7136393, 
123574939.026597)), .Names = c("sample", "scaling_factor"), class = "data.frame", row.names = c("A1", 
"A2", "A3", "A4", "A5", "A6", "A7", "B8", "B9", "B10"))

这是 data.table，每个样本有数十万行（dput 在输出中遇到< 的问题，因此此处未提供）：

setDT(dt)
    sample     contig_id product_reads_rpk
 1:     A1     contig_10        2000.00000
 2:     A1    contig_100          24.27184
 3:     A1   contig_1000        1713.90374
 4:     A1  contig_10000        2900.66225
 5:     A1 contig_100003        1713.94231
 6:     A1 contig_100004        8575.23511
 7:     A1 contig_100004       11059.32203
 8:     A2 contig_100009        6923.67400
 9:     A2 contig_100010        1285.30259
10:     A2 contig_100015          84.74576

dt[,product_rpm := product_reads_rpk/(df$scaling_factor/1000000), by = sample]

我正在尝试根据df 中每个样本的相应值在 dt 中生成一个新列product_rpm。我怎么做？我收到longer object length is not a multiple of shorter object length，但较短的对象长度为 1，例如A1 在 df 中，对吗？

【问题讨论】：

我不确定为什么merge 在这里不起作用。合并后，您可以通过将scaling_factor 与product_reads_rpk 分开来创建一个新列。
@Noah 很好地合并工作完美，但它是大量冗余数据，例如100.000 行具有相同的scaling_factor。我希望找到一个更优雅的解决方案（并且通常了解如何将两个不同的对象与 data.table 一起使用）
试试dt[setDT(df), product_rpm := product_reads_rpk / (scaling_factor / 1e6), on = .(sample)]。
@crazysantaclaus 你不能接受评论，只能接受答案。
确实如此，但您也可以根据您的评论制定答案。我只是不确定哪个解决方案更接近原始任务（“不合并”），因为我不知道您的代码是否包含隐藏的合并步骤？

标签： r data.table

【解决方案1】：

我不知道在不实际合并两个数据集的情况下执行此操作的方法 - 但如果您使用 data.table 合并数据集的方式，则可以避免创建冗余列。

所以，在你的情况下，它只是：

df <- data.table(df)
dt[df, product_rpm := (product_reads_rpk/scaling_factor/1000000), on = "sample"]

一个简单的例子：

library(data.table)

dt1 <- data.table(id = sample(1000:9999, size = 100),
                  size = sample(10000:99999, size = 100))

dt2 <- data.table(id = rep(dt1$id, 10), 
                  group = rep(LETTERS[1:5], 10),
                  value = sample(1000:9999, size = 100 * 10, replace = T))

dt3 <- dt2[dt1, metric:= (value / size), on = "id"]
head(dt3)

【讨论】：

@David Arenburg：嗯，两种解决方案都运行良好并且给出了相同的结果（除了在 royr2 的答案中围绕scaling_factor/1000000 缺少括号。你能告诉我哪个应该是公认的答案，例如做你们都在代码中包含“某种合并”步骤？
好吧，两种解决方案都运行良好并且给出了相同的结果（除了 royr2 的围绕 scaling_factor/1000000 的答案中缺少括号。你能告诉我哪个应该是公认的答案，例如你们都包括一个代码中的“某种合并”步骤？