循环左连接答案

【问题标题】：Loop over left joins循环左连接
【发布时间】：2021-01-20 19:59:05
【问题描述】：

我一直在尝试循环左连接（使用 R）。我需要创建一个表，其中的列代表更大表中的样本。新表的每一列都应代表这些样本中的每一个。

library(tidyr)

largetable <- data.frame(PlotCode=c(rep("Plot1",20),rep("Plot2",20)),
                         Category=c(rep("A",8),rep("B",8),rep("C",4),rep("A",12),rep("B",4),rep("C",4)))
                         
a <- data.frame(PlotCode=c("Plot1","Plot1","Plot2","Plot2"),
                Category=c("A","B","A","B"))

##example of code to loop over 100 left joins derived from samples of two elements from a large table. It fails to create the columns.
for (i in 1:100){
  count <- largetable %>% group_by(PlotCode) %>% sample_n(2, replace = TRUE)%>%
    count(PlotCode,Category)
  colnames(count)[3] <- paste0("n",i)
  b <- left_join(a, count, by = c("PlotCode","Category"))
}

##example of desired output table. Columns n1 to n100 should change depending of samples.
b <- data.frame(PlotCode=c("Plot1","Plot1","Plot2","Plot2"),
                Category=c("A","B","A","B"),
                n1=c(2,1,0,1),
                n2=c(1,1,1,1),
                n3=c(2,0,1,2))

如何循环左连接，使每一列对应一个不同的样本？

【问题讨论】：

您能否分享一个largetable 的小示例（类似于a）并展示您希望预期输出的外观？
谢谢罗纳克。我已经修改了脚本以显示“largetable”的示例和预期的输出表“b”。输出表中的 n1 到 n100 列应根据“大表”中的每次采样而变化
通过编辑而不是 cmets 进行澄清。该代码与您的问题有什么关系？ “任何想法”在这里不是一个有效的问题。 How to Ask
请在代码问题中给出minimal reproducible example--cut & paste & runnable code，包括最小的代表性示例输入作为代码；期望和实际输出（包括逐字错误消息）；标签和版本；明确的规范和解释。给出尽可能少的代码，即您显示的代码可以通过您显示的代码扩展为不正常的代码。（调试基础。）对于包含 DBMS 和 DDL（包括约束和索引）的 SQL，并以表格式作为代码输入。 How to Ask 暂停总体目标的工作，将代码砍到第一个表达式，没有给出你期望的内容，说出你的期望和原因。

标签： r loops join left-join subsampling

【解决方案1】：

代替for循环，我们可以使用rerun/replicate重复一个过程n次。

在每次迭代中，我们从 PlotCode 和 count 他们的 Category 中随机选择 2 行，因此您将拥有 n 列表，这些列表可以使用 reduce 连接在一起，并根据您的选择重命名列将NA 替换为 0。

library(dplyr)
library(purrr)

n <- 10

rerun(n, largetable %>% 
  group_by(PlotCode) %>% 
  slice_sample(n = 2, replace = TRUE) %>%
  count(PlotCode,Category)) %>%
  reduce(full_join, by = c('PlotCode', 'Category')) %>%
  rename_with(~paste0('n', seq_along(.)), starts_with('n')) %>%
  mutate(across(starts_with('n'), tidyr::replace_na, 0))

#  PlotCode Category    n1    n2    n3    n4    n5    n6    n7    n8    n9   n10
#  <chr>    <chr>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 Plot1    A            1     0     2     2     0     1     0     1     2     2
#2 Plot1    B            1     0     0     0     1     1     2     1     0     0
#3 Plot2    B            1     0     0     0     1     0     0     0     0     0
#4 Plot2    C            1     2     0     0     0     0     1     1     0     0
#5 Plot1    C            0     2     0     0     1     0     0     0     0     0
#6 Plot2    A            0     0     2     2     1     2     1     1     2     2

【讨论】：

非常感谢罗纳克！这段代码解决了我的问题。要使用 slice_sample 函数，我必须在最近的 R 版本 (4.0.2) 上安装 dplyr (1.0.0)。然后，代码运行没有任何问题。