删除 data.frame 中包含其他列的行答案

【问题标题】：Removing rows in data.frame having columns subsumed in others删除 data.frame 中包含其他列的行
【发布时间】：2017-08-02 01:16:12
【问题描述】：

我试图在 data.frame 中实现类似于unique 的东西，其中列中一行中的每个元素都是向量。我想要做的是，如果该帽子行的列中的向量元素是一个子集或等于另一个子集，则删除元素数量较少的行。我可以通过嵌套的for 循环来实现这一点，但由于数据包含 400,000 行，因此程序效率非常低。

样本数据

# Set the seed for reproducibility 
set.seed(42)

# Create a random data frame
mydf <- data.frame(items = rep(letters[1:4], length.out = 20), 
                   grps = sample(1:5, 20, replace = TRUE),
                   supergrp =  sample(LETTERS[1:4], replace = TRUE))


# Aggregate items into a single column
temp <- aggregate(items ~ grps + supergrp, mydf, unique)

# Arrange by number of items for each grp and supergroup 
indx <- order(lengths(temp$items), decreasing = T)
temp <- temp[indx, ,drop=FALSE]

温度看起来像

       grps supergrp   items
    1     4        D a, c, d
    2     3        D    c, d
    3     5        D    a, d
    4     1        A       b
    5     2        A       b
    6     3        A       b
    7     4        A       b
    8     5        A       b
    9     1        D       d
   10     2        D       c

现在您可以看到第二行和第三行中的 supergrp 和项目的第二个组合包含在第一行中。所以，我想从结果中删除第二行和第三行。同样，第 4 行包含第 5 行到第 8 行。最后，第 9 行和第 10 行包含在第一行中，所以我想删除第 9 行和第 10 行。因此，我的结果如下所示：

      grps supergrp   items
    1    4        D a, c, d
    4    1        A       b

我的实现如下：：

# initialise the result dataframe by first row of old data frame
newdf <-temp[1, ]

# For all rows in the the original data
for(i in 1:nrow(temp))
{
  # Index to check if all the items are found 
  indx <- TRUE

  # Check if item in the original data appears in the new data
  for(j in 1:nrow(newdf))
  {
   if(all(c(temp$supergrp[[i]], temp$items[[i]]) %in% 
          c(newdf$supergrp[[j]], newdf$items[[j]]))){
     # set indx to false if a row with same items and supergroup  
     # as the old data is found in the new data
    indx <- FALSE
   }
  }

  # If none of the rows in new data contain items and supergroup in old data append that
  if(indx){
    newdf <- rbind(newdf, temp[i, ])
  }
}

我相信有一种有效的方法可以在 R 中实现这一点；可能正在使用tidy 框架和dplyr 链，但我错过了诀窍。为一个冗长的问题道歉。任何意见将不胜感激。

【问题讨论】：

听起来像网络/图形问题。您可以使用igraph 包找到将每个grp/supergrp 配对链接到items 的图表，然后您可以为每个组分配一个“集群”以确定共享哪些items。
不是一个完整的答案，但您可以分配这些集群标识符，例如 library(igraph); int <- interaction(mydf[c("grps","supergrp")]); g <- graph.data.frame(cbind(mydf["items"],int)); clg <- clusters(g); mydf$clusters <- clg$membership[match(int, names(clg$membership))]。

标签： r dplyr tidyverse

【解决方案1】：

我会尝试从列表列中取出项目并将它们存储在更长的数据框中。这是我有点老套的解决方案：

library(stringr)

items <- temp$items %>% 
    map(~str_split(., ",")) %>% 
    map_df(~data.frame(.))

out <- bind_cols(temp[, c("grps", "supergrp")], items)

out %>% 
    gather(item_name, item, -grps, -supergrp) %>% 
    select(-item_name, -grps) %>% 
    unique() %>% 
    filter(!is.na(item))

【讨论】：

有没有办法偷偷进入结果中的grps字段。
我认为你可以使用distinct(supergrp, item) 而不是unique()