检查每行的前三列是否有子字符串答案

【问题标题】：Check if the first three columns of every row has a substring检查每行的前三列是否有子字符串
【发布时间】：2016-12-20 13:01:46
【问题描述】：

我有一个如下的数据集。

Proc1   Proc2   Proc3  Count
AAZ      BLA     C       5
D        AAZ     E       7
A        G       F       1
T        X       Y       10

我有另一个向量，如下所示。

Procs <- c("A", "B")

我希望过滤前 3 列中包含 A 和 B 的行。我想要的输出如下。

Proc1   Proc2   Proc3   Count
AAZ     BLA       C       5

如果有实现此目的的好方法，请告诉我。尝试在 apply 函数中使用 %like% 但无法获得所需的结果。

【问题讨论】：

标签： r substring apply

【解决方案1】：

这是一种使用sapply 与rowSums 和grep 的方法。对grep 的两次单独调用检查“A”和“B”是否存在。 sapply 对整个 data.frame 执行这些检查并返回矩阵。 rowSums 按行对这些逻辑矩阵求和。结果相乘，因此如果行中缺少“A”或“B”，则返回零。最后检查结果是否大于0。

keepers <- rowSums(sapply(df[1:3], function(x) grepl("A", x))) * 
           rowSums(sapply(df[1:3], function(x) grepl("B", x))) > 0

df[keepers,]
  Proc1 Proc2 Proc3 Count
1   AAZ   BLA     C     5

有可能使这变得更加动态，尽管很混乱。您可以将rowSums 函数包装在sapply 中，并为sapply 提供模式向量。这将返回一个 rowSums 矩阵。然后，您可以使用apply 对每一行应用prod 函数，然后检查正实例。

keepers <- apply(sapply(c("A", "B"),
                        function(i) rowSums(sapply(df[1:3], function(x) grepl(i, x)))),
                 1, prod) > 0

keepers
[1]  TRUE FALSE FALSE FALSE

【讨论】：

此外，在 grepl 中，我需要指定向量而不是特定模式，因为模式将由用户给出。这不适用于这种情况。

【解决方案2】：

我们遍历 'Proc' 列，检查元素是否同时包含 'A' 和 'B' 以返回 list 的逻辑 vector，Reduce 它通过比较单个 vector vectors 的对应元素用于匹配条件的行中的任何元素，并使用它来子集数据集行。

pat <- paste(paste(Procs, collapse=".*"), paste(rev(Procs), collapse=".*"), sep="|")
df1[Reduce(`|`, lapply(df1[grep("Proc", names(df1))], grepl, pattern = pat)),]
#  Proc1 Proc2 Proc3 Count
#1   AAZ   BLA     C     5

或者另一种选择是将行中的元素paste 一起做一个grep

pat <- paste(paste(Procs, collapse="[^,]*"), paste(rev(Procs), collapse="[^,]*"), sep="|")
df1[grep(pat, do.call(paste, c(df1[grep("Proc", names(df1))], sep=","))),]
#  Proc1 Proc2 Proc3 Count
#1   AAZ   BLA     C     5

数据

Procs <- c("A", "B")

【讨论】：

感谢您的意见。但模式不会固定。它们将以向量的形式从用户那里获取，例如：Procs
@Jishu 更新帖子使其更具活力
@Jishu 事实上，lapply 与在逻辑矩阵上转换相比更快更高效（因为它会消耗内存）

【解决方案3】：

Procs <- c("A", "B")

# unite all the columns you are interested to search in. Thanks to @DavidArenburg for the improvements
xxx = do.call(paste0, df[1:3])
#> xxx
#[1] "AAZBLAC" "DAAZE"   "AGF"     "TXY"   

# now iterate through the above vector and apply grepl, if the totalSum matches the 
# length of Procs - it means all characters in the Procs were present in the value of xxx

ind <- which(rowSums(sapply(Procs, grepl, xxx, fixed = TRUE)) == length(Procs))
df[ind,]
#   Proc1 Proc2 Proc3 Count
#1:   AAZ   BLA     C     5

【讨论】：

@DavidArenburg 这是一种类似的方法吗？
不知道你为什么问我，但我认为lapply/data.frame组合中没有必要，只需使用sapply代替。你可以添加fixed = TRUE 一些性能提升。像ind <-rowSums(sapply(Procs, grepl, xxx, fixed = TRUE)) == length(Procs) 这样的东西。此外，xxx 可能只是 xxx <- do.call(paste0, df[1:3])（没有包）
这就是我问你的原因，你总是对我的回答给出有效的改进。再次感谢