【问题标题】:variable length df subsampling function r变长df二次采样函数r
【发布时间】:2020-07-04 21:12:02
【问题描述】:

我需要编写一个函数,该函数涉及通过变量n bins 对 df 进行子集。就像,如果n 是 2,则在两个 bin 中对 df 进行多次二次采样(从前半部分开始,然后从后半部分开始)。如果n 为 3,则在 3 个 bin 中进行子采样(第一个 1/3,第二个 1/3,第三个 1/3)。到目前为止,我一直在手动为不同长度的 n 执行此操作,并且我知道必须有更好的方法来执行此操作。我想将它写入一个以n 作为输入的函数,但到目前为止我还不能让它工作。代码如下。

# create df
df <- data.frame(year = c(1:46), 
                 sample = seq(from=10,to=30,length.out = 46) + rnorm(46,mean=0,sd=2) )
# real df has some NAs, so we'll add some here
df[c(20,32),2] <- NA

这个 df 是 46 年的采样。我想假装不是 46 个样本,我只取了 2 个,但在上半年(1:23)随机一年,在下半年(24:46)随机一年。

# to subset in 2 groups, say, 200 times
# I'll make a df of elements to sample
samplelist <- data.frame(firstsample = sample(1:(nrow(df)/2),200,replace = T), # first sample in first half of vector
                         secondsample = sample((nrow(df)/2):nrow(df),200, replace = T) )# second sample in second half of vector
samplelist <- as.matrix(samplelist)


# start a df to add to
plot_df <- df %>% mutate(first='all',
                               second = 'all',
                               group='full')

# fill the df using coords from expand.grid
for(i in 1:nrow(samplelist)){

  plot_df <<- rbind(plot_df,
                          df[samplelist[i,] , ]   %>% 
                            mutate(
                              first = samplelist[i,1],
                              second = samplelist[i,2],
                              group = i
                            )) 
  print(i)
}

(如果我们可以让它跳过“NA”样本年份的样本,那就太好了)。

所以,如果我想为三点而不是两点执行此操作,我会像这样重复该过程:

# to subset in 3 groups 200 times
# I'll make a df of elements to sample
samplelist <- data.frame(firstsample = sample(1:(nrow(df)/3),200,replace = T), # first sample in first 1/3
                         secondsample = sample(round(nrow(df)/3):round(nrow(df)*(2/3)),200, replace = T),  # second sample in second 1/3
                         thirdsample = sample(round(nrow(df)*(2/3)):nrow(df), 200, replace=T) # third sample in last 1/3
                         )
samplelist <- as.matrix(samplelist)

# start a df to add to
plot_df <- df %>% mutate(first='all',
                         second = 'all',
                         third = 'all',
                         group='full')

# fill the df using coords from expand.grid
for(i in 1:nrow(samplelist)){

  plot_df <<- rbind(plot_df,
                    df[samplelist[i,] , ]   %>% 
                      mutate(
                        first = samplelist[i,1],
                        second = samplelist[i,2],
                        third = samplelist[i,3],
                        group = i
                      )) 
  print(i)
}

但是,我想这样做很多次,最多采样 20 次(所以在 20 个 bin 中),所以这种手动方法是不可持续的。你能帮我写一个函数说“从n个箱子中挑选一个样本x次”吗?

顺便说一句,这是我用完整的 df 制作的情节:

plot_df %>%
  ggplot(aes(x=year,y=sample)) +

  geom_point(color="grey40") +

  stat_smooth(geom="line",
              method = "lm",
              alpha=.3,
              aes(color=group,
                  group=group),
              se=F,
              show.legend = F) +
  geom_line(color="grey40") +


  geom_smooth(data = plot_df %>% filter(group %in% c("full")),
              method = "lm",
              alpha=.7,
              color="black",
              size=2,
              #se=F,
              # fill="grey40
              show.legend = F
  ) +
  theme_classic()

【问题讨论】:

    标签: r function dplyr subset subsampling


    【解决方案1】:

    这是一个使用循环的函数,更接近于你开始做的事情:

    df <- data.frame(year = c(1:46), 
                     sample = seq(from=10, to=30, length.out = 46) +
    rnorm(46,mean=0,sd=2))
    
    df[c(20,32), 2] <- NA
    
    my_function <- function(n, sample_size, data = df) {
    
      plot_df <- data %>% mutate(group = 'full')
    
      sample_matrix <- matrix(data = NA, nrow = sample_size, ncol = n)
    
      first_row <- 1 # First subset has 1 as first row, no matter how many subsets
    
      for (i in 1:n) {
    
        last_row <- round(first_row + nrow(df)/n - 1) # Determine last row of i-th subset
        sample_matrix[, i] <- sample(first_row:last_row, sample_size, replace = T) # Store sample directly in matrix
        first_row <- i + last_row # Determine first row for next i
    
        group_name <- paste("group", i, sep = "_") # Column name for i-th group
        plot_df[[group_name]] <- "all" # Column for i-th group
    
      }
    
      for (j in 1:sample_size) {
    
        # Creating a new data frame for new observations
        new_obs <- df[sample_matrix[j,], ]
        new_obs[["group"]] <- j
        for (group_n in 1:n) {
          new_obs[[paste0("group_", group_n)]] <- sample_matrix[j, group_n]
        }
        plot_df <- rbind(plot_df, new_obs) 
        plot_df <<- plot_df
    
      }
    }
    
    my_function(2, 200, data = df)
    

    【讨论】:

      【解决方案2】:

      如果我猜对了,以下函数会将您的 df 拆分为 n 个 bin,从每个 bin 中抽取 x 个样本,然后将结果放回 df 的 cols 中:

      library(tidyverse)
      
      set.seed(42)
      
      df <- data.frame(year = c(1:46), 
                       sample = seq(from=10,to=30,length.out = 46) + rnorm(46,mean=0,sd=2) )
      
      get_df_sample <- function(df, n, x) {
        df %>% 
          # bin df in n bins of (approx.) equal length
          mutate(bin = ggplot2::cut_number(seq_len(nrow(.)), n, labels = seq_len(n))) %>% 
          # split by bin
          split(.$bin) %>%
          # sample x times from each bin
          map(~ .x[sample(seq_len(nrow(.x)), x, replace = TRUE),]) %>% 
          # keep only column "sample"
          map(~ select(.x, sample)) %>% 
          # Rename: Add number of df-bin from which sample is drawn
          imap(~ rename(.x, !!sym(paste0("sample_", .y)) := sample)) %>%
          # bind
          bind_cols() %>% 
          # Add group = rownames
          rownames_to_column(var = "group")
      }
      get_df_sample(df, 3, 200) %>% 
        head()
      #>   sample_1 sample_2 sample_3 group
      #> 1 12.58631 18.27561 24.74263     1
      #> 2 19.46218 24.24423 23.44881     2
      #> 3 12.92179 18.47367 27.40558     3
      #> 4 15.22020 18.47367 26.29243     4
      #> 5 12.58631 24.24423 24.43108     5
      #> 6 19.46218 23.36464 27.40558     6
      

      reprex package (v0.3.0) 于 2020 年 3 月 24 日创建

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 2018-03-17
        • 1970-01-01
        • 2020-03-05
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 2016-05-31
        • 2018-11-06
        相关资源
        最近更新 更多