【问题标题】：Create all possible combinations of my dataframe columns in R在 R 中创建我的数据框列的所有可能组合
【发布时间】：2021-02-17 11:28:24
【问题描述】：

我正在使用 R 中具有以下列的数据框：

我想创建一个新的数据框，其中包含我的列的所有可能组合（数据应该相乘），并将数据填充如下表。棘手的部分是，如果变量的较高组合的值为 1，则所有较低的组合都应设为零。例如，在第三行； a=1，b=1，c=1。在这里，abc = 1 并且所有其他组合都应设为 NA，因为它们是 abc 的子集。我的原始数据框中有 6 列，因此使事情变得更加复杂。

【问题讨论】：

你能用你尝试过的步骤更新你的答案吗？ expand.grid 用于创建反映所有组合的数据框，但这与您想要的结果不符。

标签： python r dataframe data-science data-analysis

【解决方案1】：

首先，非常感谢您的回答。我真的很感谢你的时间。我确实采用了一种更贪婪的方法，因为它涉及的编码更少。这就是我所做的。

#向@oliver 大喊代码的第一部分

df <- data.frame(RESPID = 1:3, A = c(1, NA, 1), B = c(NA, 1, 1), C = c(1, 1, 1))

# Start by creating a wide data.frame using model.matrix
na.act <- getOption('na.action')
options('na.action' = na.pass)
df_wide <- as.data.frame(model.matrix(~ (.- RESPID)^3 - 1,  #replace 3 with the number of columns in your data
                                      data = df))
options('na.action' = na.act)
df_wide$RESPID <- df$RESPID
df_wide

#Below is the logic I used
#So, for each respondent, we calculate how many of the inital A,B,C were selected.

df_wide$count_selected<- rowSums(df_wide[, c("A","B","C")], na.rm = TRUE)

 df_wide
   A  B C A:B A:C B:C A:B:C RESPID count_selected
1  1 NA 1  NA   1  NA    NA      1              2
2 NA  1 1  NA  NA   1    NA      2              2
3  1  1 1   1   1   1     1      3              3


#Now, we can make use of count_selected to get rid of other columns. 
For instance; for count_selected = 3, we can get rid of all one and two column combinations.
#This could be coded in a better way, but I did it manually.

initial_columns = c("A","B","C")
two_combinations = c("A:B","A:C","B:C")
three_combinations = "ABC"

df_wide[df_wide$count_selected == 3,c(initial_columns, two_combinations)]<- NA
df_wide[df_wide$count_selected == 2,initial_columns]<- NA


df_wide
   A  B  C A:B A:C B:C A:B:C RESPID count_selected
1 NA NA NA  NA   1  NA    NA      1              2
2 NA NA NA  NA  NA   1    NA      2              2
3 NA NA NA  NA  NA  NA     1      3              3

【讨论】：

【解决方案2】：

这是一个相当简洁的tidyverseapproach。请参阅我的在线 cmets。请注意，此方法不会创建仅NA 的列。但如果需要，添加它们很容易。

library(tidyverse)

df <- data.frame(RESPID = 1:3, A = c(1, NA, 1), B = c(NA, 1, 1), C = c(1, 1, 1))

res_df <- df %>%
  # create new columns A:C where `1` is replaced with column name
  mutate(across(c(A:C), 
                list(`2` = ~ ifelse(!is.na(.x),
                         cur_column(),
                         NA_character_))
                ),
  # check if old columns should be set to NA
         across(c(A:C),
                ~ case_when(
                  .x == 1 & rowSums(across(c(A:C))) == 1 ~ .x,
                  TRUE ~ NA_real_)
                )
         ) %>% 
  rowwise() %>% 
  # create new column which contains new column name to be created
  mutate(res = paste(na.omit(c_across(A_2:C_2)), collapse = ""),
  # we want to pass this value to our new columns
         val = 1) %>% 
  # now lets create the columns with pivot_wider
  pivot_wider(id_cols = c(RESPID:C),
              names_from = res,
              values_from = val)

res_df
#> # A tibble: 3 x 7
#>   RESPID     A     B     C    AC    BC   ABC
#>    <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1      1    NA    NA    NA     1    NA    NA
#> 2      2    NA    NA    NA    NA     1    NA
#> 3      3    NA    NA    NA    NA    NA     1

^{由reprex package (v0.3.0) 于 2020 年 11 月 4 日创建}

【讨论】：

【解决方案3】：

这是一个解决方案，它没有得到很好的优化，但我希望它能完成工作：

ff <- function(df=data.frame(A=c(1,NA,1),B=c(NA,1,1),C=c(1,1,1))) {
  if(NCOL(df)==1) return(df)
  #create all possible combinations of names
  combinations <- unlist(lapply(2:NCOL(df), function(k) {
    kCombs <- utils::combn(names(df),k)
    apply(kCombs,2,paste0,collapse=",")
  }))
  #go through all combinations and calculate the product of the associated columns
  allColumnProducts <- lapply(combinations, function(combi) {
    colNamesInCombi <- strsplit(combi,",")[[1]]
    dfColList <- as.list(df[,colNamesInCombi])
    poductOfCols <- Reduce(f = "*",
                           x = as.list(df[,colNamesInCombi]),
                           init = numeric(NROW(df))+1)
    setNames(data.frame(poductOfCols),combi)
  })
  #put everything in one dataframe including the initial dataframe
  allColumnProducts <- do.call("cbind",allColumnProducts)
  allColumnProducts <- cbind(df,allColumnProducts)
  #clear all the subsets and make them NA
  for(columnIndex in NCOL(allColumnProducts):(NCOL(df)+1)) {
    rowsWithOne <- allColumnProducts[,columnIndex]==1
    rowsWithOne[is.na(rowsWithOne)] <- FALSE
    if(any(rowsWithOne)) {
      #we found a 1 in the column, so we have to make all columns whose
      #names are part of the name of the current column NA at these rows with 1
      nameParts <- strsplit(names(allColumnProducts)[columnIndex],",")[[1]]
      namesToFillWithNa <- unlist(lapply(1:(length(nameParts)-1), function(k) {
        kCombs <- utils::combn(nameParts,k)
        apply(kCombs,2,paste0,collapse=",")
      }))
      allColumnProducts[rowsWithOne,namesToFillWithNa] <- NA 
    }
  }
  allColumnProducts
}

您可以使用数据框调用它。如果您的数据框中的名称还包含“，”，那么对于粘贴和拆分，您将需要另一个唯一字符。我希望 cmets 能够很好地描述它的作用以及所选择的变量名称。

更新：

我只是对这项任务进行了更多思考，并在使用 NA 清理列之前在创建大数据框的部分进行了一些优化。此外，我添加了参数 uniqueString 和 removeUniqueStringInResult。 uniqueString 是一个不应出现在数据框名称中的字符串。否则一些 strsplits 将失败。 removeUniqueStringInResult 清除生成的数据帧名称中的 uniqueString。我认为这很好，因为现在你得到了你想要的结果。如果列的名称是单个字符，如果只是将名称放在一起而不用分隔字符，则不会造成混淆。

ff2 <- function(df, 
                uniqueString = ",",
                removeUniqueStringInResult = TRUE) {
  if(NCOL(df)==1) return(df)
  #go through all combinations of size k, k=2,...,NCOL(df), and calculate the product of the associated columns
  allColumnProducts <- lapply(2:NCOL(df), function(k) {
    kCombs <- utils::combn(names(df),k)
    #the columns are all possible combinations of size k of the names
    kComdDataframe <- lapply(1:NCOL(kCombs), function(i) {
      colNamesInCombi <- kCombs[,i] #columns in the origianl dataframe df with these names have to be multiplied together
      Reduce(f = "*",
             x = as.list(df[,colNamesInCombi]),
             init = numeric(NROW(df))+1)
    })
    kComdDataframe <- data.frame(do.call("cbind",kComdDataframe))
    kCombNames <- apply(kCombs,2,paste0,collapse=uniqueString)
    names(kComdDataframe) <- kCombNames
    kComdDataframe
  })
  #put everything in one dataframe including the initial dataframe
  allColumnProducts <- do.call("cbind",allColumnProducts)
  allColumnProducts <- cbind(df,allColumnProducts)
  #clear all the subsets and make them NA
  for(columnIndex in NCOL(allColumnProducts):(NCOL(df)+1)) {
    rowsWithOne <- allColumnProducts[,columnIndex]==1
    rowsWithOne[is.na(rowsWithOne)] <- FALSE
    if(any(rowsWithOne)) {
      #we found a 1 in the column, so we have to make all columns whose
      #names are part of the name of the current column NA at these rows with 1
      nameParts <- strsplit(names(allColumnProducts)[columnIndex],",")[[1]]
      namesToFillWithNa <- unlist(lapply(1:(length(nameParts)-1), function(k) {
        kCombs <- utils::combn(nameParts,k)
        apply(kCombs,2,paste0,collapse=uniqueString)
      }))
      allColumnProducts[rowsWithOne,namesToFillWithNa] <- NA 
    }
  }
  if(removeUniqueStringInResult) {
    names(allColumnProducts) <- gsub(uniqueString,"",names(allColumnProducts))
  } 
  allColumnProducts
}

执行时间得到了很好的改善，请参阅以下基准：

testdf <- data.frame(A=c(1,NA,1),B=c(NA,1,1),C=c(1,1,1),D=c(1,1,1),E=c(1,NA,1))
microbenchmark::microbenchmark(ff(testdf),ff2(testdf))
#Unit: milliseconds
#      expr    min      lq      mean  median       uq     max neval
#ff(testdf) 8.6415 8.87095 10.238998 9.00815 11.38315 23.0477   100
#ff2(testdf) 3.7638 3.86935  4.905192 4.00970  5.36295 14.2669   100

【讨论】：

【解决方案4】：

我建议使用

na.act <- getOption('na.action')
options('na.action' = na.pass)

# Create wide data.
df_wide <- as.data.frame(model.matrix(~ (.)^3 - RESPID,  #replace 3 with the number of columns in your data
                        data = df))
options('na.action' = na.act)
df_wide$RESPID <- df$RESPID

编辑：

花了一点时间，我想出了一个解决方案。它需要大量转换数据、分组、嵌套，并且可能比@Jonas 提供的答案效率低。但它逐步完成这项工作。它是...有点可读性，并且另有评论。我的主要思想是使用 3 个基本步骤来完成。它们在代码中进行了描述，但代码本身并没有很好地记录。

初始化：

df <- data.frame(RESPID = 1:3, A = c(1, NA, 1), B = c(NA, 1, 1), C = c(1, 1, 1))

# Start by creating a wide data.frame using model.matrix
na.act <- getOption('na.action')
options('na.action' = na.pass)
df_wide <- as.data.frame(model.matrix(~ (.- RESPID)^3 - 1,  #replace 3 with the number of columns in your data
                                      data = df))
options('na.action' = na.act)
df_wide$RESPID <- df$RESPID
df_wide

第 1 步：

# Next lets get the groups that should actually be filled.
# To do this, we'll 
# first) make a "long" format, with all the values that are currently filled.
# second) Find the active columns (A, B, C in this case) for the values that are filled, and count the number of active columns
# third) Find the maximum number of active columns for each active column (A, B, C)
# It is not very readable.
library(tidyverse)
library(stringr)
# Create a long version of the data, and split columns into multiple names.
df_longer <- pivot_longer(df_wide, cols = 1:7) %>% 
  mutate(name_split = str_split(name, ':')) %>% 
  mutate(col_count = lengths(name_split)) %>%
  unnest_wider(name_split) %>%
  # Sort by the number of letters used.
  arrange(col_count) %>% 
  rename('First_active' = '...1',
         'Second_Active' = '...2',
         'Third_Active' = '...3') %>% 
  pivot_longer(cols = 4:6, 
               names_to = 'second_name',
               values_to = 'second_value')

第 2 步（和第 3 步）：


# Find the columns with maximum number of letters used:
max_indx <- 
  df_longer %>% group_by(second_name) %>%
    drop_na(second_value, value) %>% 
    ungroup() %>%
    select(RESPID, second_value, value, col_count) %>%
    group_by(second_value, RESPID) %>%
    summarize(indx_max = max(col_count), .groups = 'drop') %>% 
    group_by(RESPID) %>%
    nest(data = second_value) %>% 
    select(data) %>%
    mutate(colname = paste0(data[[1]][[1]], collapse = ':')) %>% 
    ungroup() %>% 
    select(-data)

# Print for visualization
max_indx
# A tibble: 3 x 2
  RESPID colname
   <int> <chr>  
1      1 A:C    
2      3 A:B:C  
3      2 B:C

第 3 步（或第 4 步？）：


# Now that we have the max indices, lets overwrite the values in df_wide
for(i in seq_len(nrow(max_indx))){
  name <- max_indx$colname[max_indx$RESPID == i]
  df_wide[df_wide$RESPID == i, name] <- 1
  df_wide[df_wide$RESPID == i, colnames(df_wide)[!colnames(df_wide) %in% c(name, 'RESPID')]] <- 0
}
df_wide
  A B C A:B A:C B:C A:B:C RESPID
1 0 0 0   0   1   0     0      1
2 0 0 0   0   0   1     0      2
3 0 0 0   0   0   0     1      3

【讨论】：

奥利弗，谢谢你的建议。这适用于创建具有组合的数据框，但是如前所述，每行的数据应该是互斥的；即如果a=1，b=1，c=1；只有 abc 列应填充为 1，所有其他列应返回 NA。
嗨@VenkatVala。我已经更新了答案。你所要求的并不容易找到，在这些情况下，它有助于将问题分解为更小的子问题。答案中描述了我的思考过程。很可能有一种更简单的方法可以做到这一点，但这说明了如果我遇到类似问题，我将如何破解我的解决方案。如果任何步骤（或全部）似乎难以理解，我建议执行部分管道以查看结果（这是我在创建答案时所做的）。如果它看起来有效，请记住支持 Jonas 的答案。
非常感谢。我会继续尝试理解/实现您的代码。