使用 purrr 遍历两个列表，然后通过管道进入 dplyr::filter答案

【问题标题】：Using purrr to iterate over two lists and then pipe into dplyr::filter使用 purrr 遍历两个列表，然后通过管道进入 dplyr::filter
【发布时间】：2018-01-07 22:39:34
【问题描述】：

library(tidyverse)
library(purrr)

使用下面的示例数据，我可以创建以下函数：

Funs <- function(DF, One, Two){

    One <- enquo(One)
    Two <- enquo(Two)

    DF %>% filter(School == (!!One) & Code == (!!Two)) %>%
        group_by(Code, School) %>%
        summarise(Count = sum(Question1))
}

然后我可以使用该函数过滤两个变量 - 学校和代码 - 如下所示：

Funs(DF, "School1", "B344")

这很好，但是我的实际数据有很多变量，因此我不想在函数中不断输入“School”和“Code”变量，我想使用 tidyverse 和 purrr 包来循环两个列表（学校之一，代码之一）并将其输入过滤器。我希望输出是结果列表。

为简单起见，输入 dplyr::filter 的两个列表各只有两个值：School2 将使用 S300，School1 将使用 B344，就像上面的示例一样。

我尝试过的一些例子：

map2(c(“School2”, ”School1”),
     c(“S300”, ”B344”),
     function(x,y) {
         DF %>% filter(School == .x & Code == .y) %>%
             group_by(Code, School) %>%
             summarise(Count = sum(Question1))
     }

还有……

map2(c("School2", "School1")),
     c("S300","B344"),
     ~filter(School == .x & Code == .y) %>%
         group_by(Code, School)%>%
         summarise(Count = sum(Question1))

还有这个……

list(c("School2", "School1"), c("S300", "B344")) %>%
    map2( ~ filter(School == .x & Code == .y) %>%
             group_by(Code, School) %>%
             summarise(Count = sum(Question1)))

这些似乎都不起作用，因此我们将不胜感激！

样本数据：

Code <- c("B344","B555","S300","T220","B888","B888","B555","B344","B344","T220","B555","B555","S300","B555","S300","S300","S300","S300","B344","B344","B888","B888","B888")
School <- c("School1","School1","School2","School3","School4","School4","School1","School1","School3","School3","School4","School1","School1","School3","School2","School2","School4","School2","School3","School4","School3","School1","School2")
Question1 <- c(3,4,5,4,5,5,5,4,5,3,4,5,4,5,4,3,3,3,4,5,4,3,3)
Question2 <- c(5,4,3,4,3,5,4,3,2,3,4,5,4,5,4,3,4,4,5,4,3,3,4)
DF <- data_frame(Code, School, Question1, Question2)

【问题讨论】：

你可以做类似map2(c("School2", "School1"), c("S300", "B344"), ~DF %>% filter(School == .x, Code == .y) %>% group_by(Code, School) %>% summarise(Count = sum(Question1)))的事情，但这似乎真的毫无意义；像DF %>% filter(paste(School, Code) %in% paste(c("School2", "School1"), c("S300", "B344"))) %>% group_by(Code, School) %>% summarise(Count = sum(Question1)) 这样的操作更容易
看起来你的第一个建议就是我想要的。我现在意识到使用 map2_df 可能更好。另外，我最好创建两个列表，例如 list1%filter(School==.x,Code==.y)%>%group_by(Code,School)%>%summarise(Count=总和（问题1）））
如果您将其作为官方答案发布，我可以接受您的第一个回复。另外，我很想知道为什么您认为第二个建议更好？输出可能更好，但我可以使用 map2_df...

标签： r dplyr tidyverse purrr rlang

【解决方案1】：

这里有一些选项，从最像您的代码到最优化：

library(tidyverse)

DF <- data_frame(Code = c("B344", "B555", "S300", "T220", "B888", "B888", "B555", "B344", "B344", "T220", "B555", "B555", "S300", "B555", "S300", "S300", "S300", "S300", "B344", "B344", "B888", "B888", "B888"), 
                 School = c("School1", "School1", "School2", "School3", "School4", "School4", "School1", "School1", "School3", "School3", "School4", "School1", "School1", "School3", "School2", "School2", "School4", "School2", "School3", "School4", "School3", "School1", "School2"), 
                 Question1 = c(3, 4, 5, 4, 5, 5, 5, 4, 5, 3, 4, 5, 4, 5, 4, 3, 3, 3, 4, 5, 4, 3, 3), 
                 Question2 = c(5, 4, 3, 4, 3, 5, 4, 3, 2, 3, 4, 5, 4, 5, 4, 3, 4, 4, 5, 4, 3, 3, 4))

wanted <- data_frame(School = c("School2", "School1"),
                     Code = c("S300", "B344"))

要让map2 工作，如果使用波浪符号，变量命名为.x 和.y；如果您使用常规函数表示法，您可以随意调用它们。不要忘记filter的第一个参数是管道输入的数据框，所以：

map2_dfr(wanted$School, wanted$Code, ~filter(DF, School == .x, Code == .y)) %>% 
    group_by(School, Code) %>% 
    summarise_all(sum)
#> # A tibble: 2 x 4
#> # Groups: School [?]
#>   School  Code  Question1 Question2
#>   <chr>   <chr>     <dbl>     <dbl>
#> 1 School1 B344       7.00      8.00
#> 2 School2 S300      15.0      14.0

由于我将wanted 设置为数据框（普通列表也可以），因此您可以改用pmap。带有pmap 的参数名称实际上可以与两个变量的map2 相同，但它实际上是一个带有... 参数的函数，因此以不同方式处理它们通常是有意义的，例如带有..1 符号：

wanted %>% 
    pmap_dfr(~filter(DF, School == ..1, Code == ..2)) %>% 
    group_by(School, Code) %>% 
    summarise_all(sum)
#> # A tibble: 2 x 4
#> # Groups: School [?]
#>   School  Code  Question1 Question2
#>   <chr>   <chr>     <dbl>     <dbl>
#> 1 School1 B344       7.00      8.00
#> 2 School2 S300      15.0      14.0

上述两种技术的问题在于，在规模上它们会很慢，因为它们对wanted 的每一行都运行filter，这意味着你要对每一行重新测试很多次。为了保持代码相似，避免额外工作的一种稍微笨拙的方法是将列合并为一个，例如tidyr::unite:

DF %>% 
    unite(school_code, School, Code) %>% 
    filter(school_code %in% invoke(paste, wanted, sep = '_')) %>%    # or paste(wanted$School, wanted$Code, sep = '_') or equivalent
    separate(school_code, c('School', 'Code')) %>%
    group_by(School, Code) %>% 
    summarise_all(sum)
#> # A tibble: 2 x 4
#> # Groups: School [?]
#>   School  Code  Question1 Question2
#>   <chr>   <chr>     <dbl>     <dbl>
#> 1 School1 B344       7.00      8.00
#> 2 School2 S300      15.0      14.0

...或者只是将它们组合在 filter 本身中：

DF %>% 
    filter(paste(School, Code) %in% paste(wanted$School, wanted$Code)) %>%    # or invoke(paste, wanted)
    group_by(School, Code) %>% 
    summarise_all(sum)
#> # A tibble: 2 x 4
#> # Groups: School [?]
#>   School  Code  Question1 Question2
#>   <chr>   <chr>     <dbl>     <dbl>
#> 1 School1 B344       7.00      8.00
#> 2 School2 S300      15.0      14.0

现在我已将wanted 设置为数据框：连接，旨在完成这项工作，因此获得所需结果的最佳方法可能更加明显：

DF %>% 
    inner_join(wanted) %>% 
    group_by(School, Code) %>% 
    summarise_all(sum)
#> Joining, by = c("Code", "School")
#> # A tibble: 2 x 4
#> # Groups: School [?]
#>   School  Code  Question1 Question2
#>   <chr>   <chr>     <dbl>     <dbl>
#> 1 School1 B344       7.00      8.00
#> 2 School2 S300      15.0      14.0

【讨论】：

只需将此答案复制/粘贴到 R Markdown 笔记本中，以便我可以运行/理解每个步骤（因为我有类似的问题）。当我走到最后看到inner_join 时，我简直不敢相信解决我的类似问题应该是多么简单和优雅！