【问题标题】:Create all possible combinations of my dataframe columns in R在 R 中创建我的数据框列的所有可能组合
【发布时间】:2021-02-17 11:28:24
【问题描述】:

我正在使用 R 中具有以下列的数据框:

我想创建一个新的数据框,其中包含我的列的所有可能组合(数据应该相乘),并将数据填充如下表。棘手的部分是,如果变量的较高组合的值为 1,则所有较低的组合都应设为零。例如,在第三行; a=1,b=1,c=1。在这里,abc = 1 并且所有其他组合都应设为 NA,因为它们是 abc 的子集。 我的原始数据框中有 6 列,因此使事情变得更加复杂。

【问题讨论】:

  • 你能用你尝试过的步骤更新你的答案吗? expand.grid 用于创建反映所有组合的数据框,但这与您想要的结果不符。

标签: python r dataframe data-science data-analysis


【解决方案1】:

首先,非常感谢您的回答。我真的很感谢你的时间。 我确实采用了一种更贪婪的方法,因为它涉及的编码更少。 这就是我所做的。

#向@oliver 大喊代码的第一部分

df <- data.frame(RESPID = 1:3, A = c(1, NA, 1), B = c(NA, 1, 1), C = c(1, 1, 1))

# Start by creating a wide data.frame using model.matrix
na.act <- getOption('na.action')
options('na.action' = na.pass)
df_wide <- as.data.frame(model.matrix(~ (.- RESPID)^3 - 1,  #replace 3 with the number of columns in your data
                                      data = df))
options('na.action' = na.act)
df_wide$RESPID <- df$RESPID
df_wide

#Below is the logic I used
#So, for each respondent, we calculate how many of the inital A,B,C were selected.

df_wide$count_selected<- rowSums(df_wide[, c("A","B","C")], na.rm = TRUE)

 df_wide
   A  B C A:B A:C B:C A:B:C RESPID count_selected
1  1 NA 1  NA   1  NA    NA      1              2
2 NA  1 1  NA  NA   1    NA      2              2
3  1  1 1   1   1   1     1      3              3


#Now, we can make use of count_selected to get rid of other columns. 
For instance; for count_selected = 3, we can get rid of all one and two column combinations.
#This could be coded in a better way, but I did it manually.

initial_columns = c("A","B","C")
two_combinations = c("A:B","A:C","B:C")
three_combinations = "ABC"

df_wide[df_wide$count_selected == 3,c(initial_columns, two_combinations)]<- NA
df_wide[df_wide$count_selected == 2,initial_columns]<- NA


df_wide
   A  B  C A:B A:C B:C A:B:C RESPID count_selected
1 NA NA NA  NA   1  NA    NA      1              2
2 NA NA NA  NA  NA   1    NA      2              2
3 NA NA NA  NA  NA  NA     1      3              3


【讨论】:

    【解决方案2】:

    这是一个相当简洁的tidyverseapproach。请参阅我的在线 cmets。请注意,此方法不会创建仅NA 的列。但如果需要,添加它们很容易。

    library(tidyverse)
    
    df <- data.frame(RESPID = 1:3, A = c(1, NA, 1), B = c(NA, 1, 1), C = c(1, 1, 1))
    
    res_df <- df %>%
      # create new columns A:C where `1` is replaced with column name
      mutate(across(c(A:C), 
                    list(`2` = ~ ifelse(!is.na(.x),
                             cur_column(),
                             NA_character_))
                    ),
      # check if old columns should be set to NA
             across(c(A:C),
                    ~ case_when(
                      .x == 1 & rowSums(across(c(A:C))) == 1 ~ .x,
                      TRUE ~ NA_real_)
                    )
             ) %>% 
      rowwise() %>% 
      # create new column which contains new column name to be created
      mutate(res = paste(na.omit(c_across(A_2:C_2)), collapse = ""),
      # we want to pass this value to our new columns
             val = 1) %>% 
      # now lets create the columns with pivot_wider
      pivot_wider(id_cols = c(RESPID:C),
                  names_from = res,
                  values_from = val)
    
    res_df
    #> # A tibble: 3 x 7
    #>   RESPID     A     B     C    AC    BC   ABC
    #>    <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
    #> 1      1    NA    NA    NA     1    NA    NA
    #> 2      2    NA    NA    NA    NA     1    NA
    #> 3      3    NA    NA    NA    NA    NA     1
    

    reprex package (v0.3.0) 于 2020 年 11 月 4 日创建

    【讨论】:

      【解决方案3】:

      这是一个解决方案,它没有得到很好的优化,但我希望它能完成工作:

      ff <- function(df=data.frame(A=c(1,NA,1),B=c(NA,1,1),C=c(1,1,1))) {
        if(NCOL(df)==1) return(df)
        #create all possible combinations of names
        combinations <- unlist(lapply(2:NCOL(df), function(k) {
          kCombs <- utils::combn(names(df),k)
          apply(kCombs,2,paste0,collapse=",")
        }))
        #go through all combinations and calculate the product of the associated columns
        allColumnProducts <- lapply(combinations, function(combi) {
          colNamesInCombi <- strsplit(combi,",")[[1]]
          dfColList <- as.list(df[,colNamesInCombi])
          poductOfCols <- Reduce(f = "*",
                                 x = as.list(df[,colNamesInCombi]),
                                 init = numeric(NROW(df))+1)
          setNames(data.frame(poductOfCols),combi)
        })
        #put everything in one dataframe including the initial dataframe
        allColumnProducts <- do.call("cbind",allColumnProducts)
        allColumnProducts <- cbind(df,allColumnProducts)
        #clear all the subsets and make them NA
        for(columnIndex in NCOL(allColumnProducts):(NCOL(df)+1)) {
          rowsWithOne <- allColumnProducts[,columnIndex]==1
          rowsWithOne[is.na(rowsWithOne)] <- FALSE
          if(any(rowsWithOne)) {
            #we found a 1 in the column, so we have to make all columns whose
            #names are part of the name of the current column NA at these rows with 1
            nameParts <- strsplit(names(allColumnProducts)[columnIndex],",")[[1]]
            namesToFillWithNa <- unlist(lapply(1:(length(nameParts)-1), function(k) {
              kCombs <- utils::combn(nameParts,k)
              apply(kCombs,2,paste0,collapse=",")
            }))
            allColumnProducts[rowsWithOne,namesToFillWithNa] <- NA 
          }
        }
        allColumnProducts
      }
      

      您可以使用数据框调用它。如果您的数据框中的名称还包含“,”,那么对于粘贴和拆分,您将需要另一个唯一字符。 我希望 cmets 能够很好地描述它的作用以及所选择的变量名称。

      更新:

      我只是对这项任务进行了更多思考,并在使用 NA 清理列之前在创建大数据框的部分进行了一些优化。 此外,我添加了参数 uniqueString 和 removeUniqueStringInResult。 uniqueString 是一个不应出现在数据框名称中的字符串。否则一些 strsplits 将失败。 removeUniqueStringInResult 清除生成的数据帧名称中的 uniqueString。我认为这很好,因为现在你得到了你想要的结果。如果列的名称是单个字符,如果只是将名称放在一起而不用分隔字符,则不会造成混淆。

      ff2 <- function(df, 
                      uniqueString = ",",
                      removeUniqueStringInResult = TRUE) {
        if(NCOL(df)==1) return(df)
        #go through all combinations of size k, k=2,...,NCOL(df), and calculate the product of the associated columns
        allColumnProducts <- lapply(2:NCOL(df), function(k) {
          kCombs <- utils::combn(names(df),k)
          #the columns are all possible combinations of size k of the names
          kComdDataframe <- lapply(1:NCOL(kCombs), function(i) {
            colNamesInCombi <- kCombs[,i] #columns in the origianl dataframe df with these names have to be multiplied together
            Reduce(f = "*",
                   x = as.list(df[,colNamesInCombi]),
                   init = numeric(NROW(df))+1)
          })
          kComdDataframe <- data.frame(do.call("cbind",kComdDataframe))
          kCombNames <- apply(kCombs,2,paste0,collapse=uniqueString)
          names(kComdDataframe) <- kCombNames
          kComdDataframe
        })
        #put everything in one dataframe including the initial dataframe
        allColumnProducts <- do.call("cbind",allColumnProducts)
        allColumnProducts <- cbind(df,allColumnProducts)
        #clear all the subsets and make them NA
        for(columnIndex in NCOL(allColumnProducts):(NCOL(df)+1)) {
          rowsWithOne <- allColumnProducts[,columnIndex]==1
          rowsWithOne[is.na(rowsWithOne)] <- FALSE
          if(any(rowsWithOne)) {
            #we found a 1 in the column, so we have to make all columns whose
            #names are part of the name of the current column NA at these rows with 1
            nameParts <- strsplit(names(allColumnProducts)[columnIndex],",")[[1]]
            namesToFillWithNa <- unlist(lapply(1:(length(nameParts)-1), function(k) {
              kCombs <- utils::combn(nameParts,k)
              apply(kCombs,2,paste0,collapse=uniqueString)
            }))
            allColumnProducts[rowsWithOne,namesToFillWithNa] <- NA 
          }
        }
        if(removeUniqueStringInResult) {
          names(allColumnProducts) <- gsub(uniqueString,"",names(allColumnProducts))
        } 
        allColumnProducts
      }
      

      执行时间得到了很好的改善,请参阅以下基准:

      testdf <- data.frame(A=c(1,NA,1),B=c(NA,1,1),C=c(1,1,1),D=c(1,1,1),E=c(1,NA,1))
      microbenchmark::microbenchmark(ff(testdf),ff2(testdf))
      #Unit: milliseconds
      #      expr    min      lq      mean  median       uq     max neval
      #ff(testdf) 8.6415 8.87095 10.238998 9.00815 11.38315 23.0477   100
      #ff2(testdf) 3.7638 3.86935  4.905192 4.00970  5.36295 14.2669   100
      

      【讨论】:

        【解决方案4】:

        我建议使用

        na.act <- getOption('na.action')
        options('na.action' = na.pass)
        
        # Create wide data.
        df_wide <- as.data.frame(model.matrix(~ (.)^3 - RESPID,  #replace 3 with the number of columns in your data
                                data = df))
        options('na.action' = na.act)
        df_wide$RESPID <- df$RESPID
        

        编辑:

        花了一点时间,我想出了一个解决方案。它需要大量转换数据、分组、嵌套,并且可能比@Jonas 提供的答案效率低。但它逐步完成这项工作。它是...有点可读性,并且另有评论。我的主要思想是使用 3 个基本步骤来完成。它们在代码中进行了描述,但代码本身并没有很好地记录。

        初始化:

        df <- data.frame(RESPID = 1:3, A = c(1, NA, 1), B = c(NA, 1, 1), C = c(1, 1, 1))
        
        # Start by creating a wide data.frame using model.matrix
        na.act <- getOption('na.action')
        options('na.action' = na.pass)
        df_wide <- as.data.frame(model.matrix(~ (.- RESPID)^3 - 1,  #replace 3 with the number of columns in your data
                                              data = df))
        options('na.action' = na.act)
        df_wide$RESPID <- df$RESPID
        df_wide
        

        第 1 步:

        # Next lets get the groups that should actually be filled.
        # To do this, we'll 
        # first) make a "long" format, with all the values that are currently filled.
        # second) Find the active columns (A, B, C in this case) for the values that are filled, and count the number of active columns
        # third) Find the maximum number of active columns for each active column (A, B, C)
        # It is not very readable.
        library(tidyverse)
        library(stringr)
        # Create a long version of the data, and split columns into multiple names.
        df_longer <- pivot_longer(df_wide, cols = 1:7) %>% 
          mutate(name_split = str_split(name, ':')) %>% 
          mutate(col_count = lengths(name_split)) %>%
          unnest_wider(name_split) %>%
          # Sort by the number of letters used.
          arrange(col_count) %>% 
          rename('First_active' = '...1',
                 'Second_Active' = '...2',
                 'Third_Active' = '...3') %>% 
          pivot_longer(cols = 4:6, 
                       names_to = 'second_name',
                       values_to = 'second_value') 
        

        第 2 步(和第 3 步):

        
        # Find the columns with maximum number of letters used:
        max_indx <- 
          df_longer %>% group_by(second_name) %>%
            drop_na(second_value, value) %>% 
            ungroup() %>%
            select(RESPID, second_value, value, col_count) %>%
            group_by(second_value, RESPID) %>%
            summarize(indx_max = max(col_count), .groups = 'drop') %>% 
            group_by(RESPID) %>%
            nest(data = second_value) %>% 
            select(data) %>%
            mutate(colname = paste0(data[[1]][[1]], collapse = ':')) %>% 
            ungroup() %>% 
            select(-data)
        
        # Print for visualization
        max_indx
        # A tibble: 3 x 2
          RESPID colname
           <int> <chr>  
        1      1 A:C    
        2      3 A:B:C  
        3      2 B:C  
        

        第 3 步(或第 4 步?):

        
        # Now that we have the max indices, lets overwrite the values in df_wide
        for(i in seq_len(nrow(max_indx))){
          name <- max_indx$colname[max_indx$RESPID == i]
          df_wide[df_wide$RESPID == i, name] <- 1
          df_wide[df_wide$RESPID == i, colnames(df_wide)[!colnames(df_wide) %in% c(name, 'RESPID')]] <- 0
        }
        df_wide
          A B C A:B A:C B:C A:B:C RESPID
        1 0 0 0   0   1   0     0      1
        2 0 0 0   0   0   1     0      2
        3 0 0 0   0   0   0     1      3
        

        【讨论】:

        • 奥利弗,谢谢你的建议。这适用于创建具有组合的数据框,但是如前所述,每行的数据应该是互斥的;即如果a=1,b=1,c=1;只有 abc 列应填充为 1,所有其他列应返回 NA。
        • 嗨@VenkatVala。我已经更新了答案。你所要求的并不容易找到,在这些情况下,它有助于将问题分解为更小的子问题。答案中描述了我的思考过程。很可能有一种更简单的方法可以做到这一点,但这说明了如果我遇到类似问题,我将如何破解我的解决方案。如果任何步骤(或全部)似乎难以理解,我建议执行部分管道以查看结果(这是我在创建答案时所做的)。如果它看起来有效,请记住支持 Jonas 的答案。
        • 非常感谢。我会继续尝试理解/实现您的代码。
        猜你喜欢
        • 1970-01-01
        • 2018-11-29
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 2017-09-06
        相关资源
        最近更新 更多