R采样直到满足条件答案

【问题标题】：R sample until a condition is metR采样直到满足条件
【发布时间】：2019-12-17 16:58:10
【问题描述】：

所以我有以下数据框：

structure(list(V1 = c(45L, 17L, 28L, 26L, 18L, 41L, 26L, 20L, 
23L, 31L, 48L, 23L, 32L, 18L, 30L, 11L, 26L)), .Names = "V1", row.names = c("24410", 
"26526", "26527", "43264", "63594", "125630", "148318", "245516", 
"269500", "293171", "301217", "400294", "401765", "520084", "545501", 
"564914", "742654"), class = "data.frame")

行名代表地块，V1 显示我可以从中提取的每个地块的示例数。我想要的是从每个包裹中抽取与可用示例数量成比例的样本，最终每个包裹总共有 400 个示例。我们的想法是不要对一个地块进行过度采样，以尊重其他地块。

正在进行采样的数据集是here。

到目前为止，代码如下所示：

df <- read.csv('/data/samplefrom.csv')
df.training <- data.frame()
n <- 400

for(crop in sort(unique(df$code_surveyed))){
  for (bbch_stage in sort(unique(df$bbch))) {
    df.int <- df[df$bbch==bbch_stage & df$code_surveyed == crop,]
    df.int <- df.int[!is.na(df.int$name),]
    rawnum <- nrow(df[df$bbch==bbch_stage & df$code_surveyed == crop,])
    if(rawnum >= n){
      df.bbch.slected<-df[df$bbch==bbch_stage & df$code_surveyed == crop,]
      df.bbch.slected.sampled<-df.bbch.slected[sample(nrow(df.bbch.slected), n),] #(round(n_bbch*length(which(df$bbch==bbch_stage))))), ]
      df.training<-rbind(df.training,df.bbch.slected.sampled)
    }

  }
}

它的作用是为每个crop + bbch_stage 组合随机抽取400个样本（将其理解为复合变量）。这一切都很好而且很花哨，但我希望能够控制示例来自哪个包裹（变量objectid）。本质上是采样时的额外过滤步骤。

我已经尝试过使用 while 和 repeat 语句以及来自 devtools 的 stratified 函数的一些尝试，但它们似乎都没有产生我所追求的。

【问题讨论】：

400 个来自每个类别的所有地块的示例。有问题的类是crop+bbch_stage 组合。如果您使用 samplefrom.csv 执行小片段，它将变得更加清晰。

标签： r sample

【解决方案1】：

在经历了几起波折之后，我走到了这一步：

df.training<-data.frame()
for (crop in unique(df$code)) {
  df.crop.slected<-df[df$code==crop,]
  df.crop.slected.sampled <- data.frame()
  while(nrow(df.crop.slected.sampled) < 400){
    for(parcel in 1:length(unique(df.crop.slected$objectid))){
      df.crop.slected.pacel <- df.crop.slected[df.crop.slected$objectid == unique(df.crop.slected$objectid)[parcel],]
      df.crop.slected.pacel <- df.crop.slected.pacel[sample(nrow(df.crop.slected.pacel), 1), ]
      if(! df.crop.slected.pacel$name %in% df.crop.slected.sampled$name){
        df.crop.slected.sampled <- rbind(df.crop.slected.sampled, df.crop.slected.pacel)
      }

    }
  }
  df.training<-rbind(df.training,df.crop.slected.sampled)
}

虽然肯定不是最优雅的，但它可以胜任。如果有人可以指导我使用分层抽样功能，以更简单的方式实现这一点，我仍然非常感激。

【讨论】：