根据条件使用 dplyr 过滤作为循环答案

【问题标题】：Filter using dplyr based on condition as a loop根据条件使用 dplyr 过滤作为循环
【发布时间】：2018-01-27 11:52:09
【问题描述】：

我有两个数据框。 df1

    col1
 1  apples
 2 oranges
 3  apples
 4  banana

和 df2

   setID    col1
1      1  apples
2      1 oranges
3      1  apples
4      1  banana
5      2  apples
6      2 oranges
7      2 oranges
8      2  apples
9      3 oranges
10     3  grapes
11     3  banana
12     3  banana
13     4  apples
21     4  oranges
31     4  apples
41     4 oranges

我使用 dplyr 包中的过滤器将 df2 缩小为 df1$col1[1] 并将结果放入 tempdf

> tempdf <- df2 %>% group_by(setID) %>% filter(any(col1==df1$col1[1]))
> tempdf
    # A tibble: 12 x 2
# Groups:   setID [3]
   setID    col1
   <dbl>   <chr>
 1     1  apples
 2     1 oranges
 3     1  apples
 4     1  banana
 5     2  apples
 6     2 banana
 7     2 oranges
 8     2  apples
 9     4  apples
10     4  oranges
11     4  grapes
12     4 oranges

它需要按 setID 分组，因为我正在尝试查看哪个 setID 与 df1 最匹配。由于 setID = 3 没有任何苹果，它在第一次迭代中被省略。我希望这是一个循环，以便每次运行时 tempdf 都会得到更多过滤。

对于下一个元素，它应该 df1$col1[1:2] 以便它按顺序检查“apples”和“oranges”。所以在第二次迭代之后 tempdf 应该是这样的。

setID    col1
   <dbl>   <chr>
 1     1  apples
 2     1 oranges
 3     1  apples
 4     1  banana
 5     4  apples
 6     4  oranges
 7     4  grapes
 8     4 oranges

下一次迭代应该检查 df1$col1[2:3] 所以它总是检查前一个元素。这次是“橙子”和“苹果”。

 setID    col1
   <dbl>   <chr>
 1     1  apples
 2     1 oranges
 3     1  apples
 4     1  banana

一旦它到达一个唯一的 setID，循环应该结束。结果将是 df1 与 df2 的 setID=1 匹配。

【问题讨论】：

这个问题很不清楚。请先说明您要完成的工作，展示所需的最终结果，然后让论坛决定如何让您到达那里。

标签： r dataframe dplyr

【解决方案1】：

解决方案可以基于选择setID在 col1 上具有最大匹配为：

df2 %>% group_by(setID) %>% mutate( maxMatch = sum(col1==df1$col1)) %>% 
  ungroup() %>%
   filter(maxMatch == max(maxMatch)) %>%
  select(-maxMatch)

#Result
# A tibble: 4 x 2
  setID col1   
  <int> <chr>  
1     1 apples 
2     1 oranges
3     1 apples 
4     1 banana

【讨论】：

我非常喜欢这个答案。但是有没有一种方法可以通过一个一个地遍历 df1 中的每个元素来过滤 df2 ？假设您得到的 df1 不是一个整体，而是一个元素一个元素。因此，您实时尝试将 df2 缩小到可能的匹配项。有点像搜索引擎上的自动完成。
@Nix 迭代正在df2 上进行，这将是一次迭代。最终，在任何解决方案中，df2 的每个元素都必须与 df1 进行比较。
@Nix 很高兴接受一个答案，以便对未来的用户有所帮助。看看here
我询问是否可以逐个元素进行比较的原因是，理想情况下，我只会将 df2 作为一个完整的数据框。 df1 将是 df2 的子集，主要是近似匹配，很少是精确匹配。当我开始将它与 df2 进行比较时，df1 并不完整。通过向其添加更多元素，df1 变得越完整，我可以从 df2 中找到更好的匹配项。抱歉，如果这太混乱而无法理解。并感谢您的建议。

【解决方案2】：

这是过滤setID 的一种方法。在这种情况下，它将直接将 setID 返回为 1 而无需循环。缺点是如果col1 中的所有项目都没有完全匹配，它将不会返回任何结果。但是，它应该比 for 循环方法快得多。

library(dplyr)

df3 <- df1 %>% summarise(col1 = toString(col1))

df4 <- df2 %>%
  group_by(setID) %>%
  summarise(col1 = toString(col1)) %>%
  semi_join(df3, by = "col1")
df4
# # A tibble: 1 x 2
#   setID col1                           
#   <int> <chr>                          
# 1     1 apples, oranges, apples, banana

数据

df1 <- read.table(text = "    col1
 1  apples
                  2 oranges
                  3  apples
                  4  banana",
                  header = TRUE, stringsAsFactors = FALSE)

df2 <- read.table(text = "   setID    col1
1      1  apples
                  2      1 oranges
                  3      1  apples
                  4      1  banana
                  5      2  apples
                  6      2 oranges
                  7      2 oranges
                  8      2  apples
                  9      3 oranges
                  10     3  grapes
                  11     3  banana
                  12     3  banana
                  13     4  apples
                  21     4  oranges
                  31     4  apples
                  41     4 oranges",
                  header = TRUE, stringsAsFactors = FALSE)

【讨论】：

这个主意不错。但它并没有完成我想要达到的目标。我需要一一检查元素。精确匹配不是必须的。近似匹配就足够了。