通过部分匹配（R）连接不同数量的列答案

【问题标题】：Concatenate varying number of columns by partial match (R)通过部分匹配（R）连接不同数量的列
【发布时间】：2016-12-15 19:25:21
【问题描述】：

关于 SO 的第一个问题，虽然我已经潜伏了一段时间！我试图尽职尽责，离答案越来越近了。

我有一个 300 列的数据框，我想根据匹配变量名称的模式将其合并为大约 10 列。原始数据输出为我提供了一列，其中包含主要变量名称（在示例中为“之前”和“之后”）和一个数字。在我的“真实”数据中，每个变量大约有 30 个副本。

我想合并名称中包含“之前”或“之后”等的每一列。我使用 data.table 的语法为这种类型的“计算”列成功创建了变量“new”。

myTable2[, new := paste(before1, before2, sep = "")]

> myTable2
 herenow     before1 before2 before3  after1 after2 after3         new
1: 0.3399679      if     and   where     not   here  blank       ifand
2: 0.8181909     for      in      by through  blank  blank       forin
3: 0.2237681     and   where            mine  yours   ours    andwhere
4: 0.6161998     and   where              ha    hey    hon    andwhere
5: 0.7606252   fifth  eighth     and   where    not   beet fiftheighth
6: 0.5525105     and   where     not    fill           are    andwhere

但正如您所见，这明确表示我要合并的列。我想灵活地组合，这样如果我有一个变量的 31 个副本和另一个变量的 86 个副本，我不需要 a) 必须知道或者 b) 必须输入。我只想根据基本变量名称（例如“之前”）进行匹配并合并列。

我尝试使用 grep 进入下一个级别...

> newvar2 <- paste(grep("before", colnames(myTable2), value = TRUE), collapse = "")
> newvar2
[1] "before1before2before3"

这向我证实了我可以将可变数量的值与 grep 模式匹配结合起来。

下一步：如何将这两个步骤结合起来，以使

new := paste(etc....)

将 grep 步骤作为其参数并组合所有名称与模式匹配的列？这就是我想要的：

 herenow        before_Final    after_Final
1: 0.339967856  ifandwhere      nothereblank
2: 0.818190875  forinby         throughblankblank
3: 0.223768051  andwhere        mineyoursours
4: 0.616199835  andwhere        haheyhon
5: 0.760625218  fiftheighthand  wherenotbeet
6: 0.552510532  andwherenot     fillare

我正在努力学习更多关于矢量化的知识，但如果我什至可以列出我想要组合的变量类型（例如之前、之后、之间），然后可能在循环中运行这些变量类型，那将是伟大的！所以像

finalVarNames <- c("Before_final", "After_final", "Between_final")
whatToMatch <- c("before", "after", "between")

（此处为循环...）

myTable2[, finalVarNames[i] := paste(grep(whatToMatch[i], myTable2, value = TRUE), collapse = "")]

我知道语法不正确，可能在 value 参数之前的第二个“myTable2”引用中。此代码确实成功创建了新变量，但它是空白的。如何将串联的 grep 匹配变量组放入其中？

感谢您提供的任何帮助！

【问题讨论】：

作为起点，请参阅do.call(paste, c(sep = "", myTable2[startsWith(names(myTable2), whatToMatch[i])]))
c() 没有采用sep= 参数。文档说唯一的选择是“递归”。 c() 应该列个清单吧？我拆分了startsWith(names(myTable2), whatToMatch[1]) 来测试它，它给了我一个逻辑向量，在这种情况下每个列名是否以“之前”开头。然后，当我将 myTable2 用括号括起来时，它只给了我前 3 行数据，所有变量仍然完好无损。
感谢您对startsWith 的介绍。比 grep、IMO 更直观。对上述评论的更正：当我将myTable2 用括号括起来时，它只给了我 2:4 的数据行，所有变量仍然完好无损。我的猜测是因为它使用“TRUE”输出作为子集的索引。
sep = 是 c 的命名参数，作为 ... 传递——例如c(sep = "", a = 2, '1 != 2' = TRUE, fac = factor(1)) 返回一个命名的“字符”向量，其值为... 参数，并“命名”... 的标记。我猜你观察到的子集是因为你用“逻辑”向量而不是“data.frame”对“data.table”进行子集。只是为了了解c(sep = "", a subset of myTable2) 正在做什么，传递给do.call，请尝试将您的myTable2 转换为“data.frame”。如果您需要特定的“data.table”方法，也可以添加“data.table”标签。
谢谢你，@alexis_laz。尝试将其作为 data.frame 确实连接了正确的列组！现在看起来我必须使用 data.frame 方法与 data.table 来分配该列。（我可以去任何一种方式，我只是听说更大的文件 fread 可能会更快）。

标签： r data.table concatenation multiple-columns

【解决方案1】：

您可以使用Reduce 函数通过grep 在.SD 语法中指定列来将选定的列粘贴在一起。以下是使用data.table 包获取结果的示例：

library(stringi); library(data.table)
myTable2[, paste(stri_trans_totitle(whatToMatch), "final", sep = "_") := 
           lapply(whatToMatch, function(wtm) Reduce(function(x,y) paste(x, y, sep = ""), 
                                             .SD[, grep(wtm, names(myTable2)), with = F]))]

myTable2
#      herenow before1 before2 before3  after1 after2 after3   Before_final       After_final
# 1: 0.3399679      if     and   where     not   here  blank     ifandwhere      nothereblank
# 2: 0.8181909     for      in      by through  blank  blank        forinby throughblankblank
# 3: 0.2237681     and   where            mine  yours   ours       andwhere     mineyoursours
# 4: 0.6161998     and   where              ha    hey    hon       andwhere          haheyhon
# 5: 0.7606252   fifth  eighth     and   where    not   beet fiftheighthand      wherenotbeet
# 6: 0.5525105     and   where     not    fill           are    andwherenot           filler

do.call 和 Reduce 的一些基准测试：

dim(myTable2)
# [1] 1572864       9

reduce <- function() myTable2[, paste(stri_trans_totitle(whatToMatch[1:2]), "final", sep = "_") := lapply(whatToMatch[1:2], function(wtm) Reduce(function(x,y) paste(x, y, sep = ""), .SD[, grep(wtm, names(myTable2)), with = F]))]    
docall <- function() myTable2[, paste(stri_trans_totitle(whatToMatch[1:2]), "final", sep = "_") := lapply(whatToMatch[1:2], function(wtm) do.call(paste, c(sep = "", .SD[, grep(wtm, names(myTable2)), with = F])))]

microbenchmark::microbenchmark(docall(), reduce(), times = 10)
# Unit: milliseconds
#     expr      min        lq      mean    median        uq       max neval
# docall() 707.7818  722.6037  767.8923  737.6272  852.4909  868.8202    10
# reduce() 999.4925 1009.5146 1026.6200 1020.4637 1046.7073 1067.7479    10

【讨论】：

我认为Reduce(paste, ) 与它的等价物do.call(paste, ) 相比效率低得不必要，因为Reduce 所有中间“字符”向量都被一次又一次地扫描和复制，直到最终的“字符”制作完成。
Reduce 像 paste(paste(paste(x, y), z), ...) 一样工作，而 do.call 进行并评估 paste(x, y, z) 调用。前者必须（1）缓存，（2）扫描，（3）复制所有中间“字符”结果，而do.call 分配一次适当的缓冲区，然后连接所有元素。此外，考虑到 Q 中提到的列数，使用像 x = rep_len(list(rep_len(letters, 1e5)), 50); identical(Reduce(paste, x), do.call(paste, x)); microbenchmark::microbenchmark(Reduce(paste, x), do.call(paste, x), times = 25) 这样的基准测试的差异更加明显
对于基准输入，通常最好显示生成它的代码，而不仅仅是显示其尺寸。