R：根据列中的类别从数据框中删除重复项答案

【问题标题】：R: Remove duplicates from a dataframe based on categories in a columnR：根据列中的类别从数据框中删除重复项
【发布时间】：2017-11-28 19:54:31
【问题描述】：

这是我的示例数据集：

      Name Course Cateory
 1: Jason     ML      PT
 2: Jason     ML      DI
 3: Jason     ML      GT
 4: Jason     ML      SY
 5: Jason     DS      SY
 6: Jason     DS      DI
 7: Nancy     ML      PT
 8: Nancy     ML      SY
 9: Nancy     DS      DI
10: Nancy     DS      GT
11: James     ML      SY
12:  John     DS      GT

我想删除重复的行以在数据框中拥有唯一的行。删除重复行基于category 列中的值。 category 列中的值的偏好按此顺序 {'PT','DI','GT','SY'} 给出。

我的输出数据框如下所示：

  Name Course Cateory
1: Jason     ML      PT
2: Jason     DS      DI
3: Nancy     ML      PT
4: Nancy     DS      DI
5: James     ML      SY
6:  John     DS      GT

目前，我正在使用for 循环和if 条件的组合。由于输入数据框很大（1000 万行），因此需要很长时间。有没有更好更有效的方法来执行相同的操作？

【问题讨论】：

根据Name 和Course 列，您似乎正在删除。再次检查。
在某种程度上你是对的。但删除仍然取决于category 列和特定顺序。
这个问题并没有说清楚。首先按Category 列排序，然后根据Name 和Course 删除重复项。

标签： r

【解决方案1】：

这是一个按照您的要求执行的 sn-p：

df$Category <- factor(df$Category, levels = c("PT", "DI", "GT", "SY"))

df <- df[order(df$Category),]

df[!duplicated(df[,c('Name', 'Course')]),]

输出：

Name Course Category
Jason     ML       PT
Nancy     ML       PT
Jason     DS       DI
Nancy     DS       DI
John      DS       GT
James     ML       SY

想法是我们根据优先级结构进行排序。然后我们应用唯一操作，这将返回第一个匹配项。回报将是我们想要的。

【讨论】：

【解决方案2】：

既然你提到你有 1000 万行，这里有一个data.table 解决方案：

library(data.table)

setDT(df)[, .SD[which.min(factor(Category, levels = c("PT","DI","GT","SY")))], by=.(Name, Course)]

结果：

    Name Course Category
1: Jason     ML       PT
2: Jason     DS       DI
3: Nancy     ML       PT
4: Nancy     DS       DI
5: James     ML       SY
6:  John     DS       GT

基准测试：

# Random resampling of `df` to generate 10 million rows
set.seed(123)
df_large = data.frame(lapply(df, sample, 1e7, replace = TRUE))

# Data prep Base R  
df1 <- df_large

df1$Category <- factor(df1$Category, levels = c("PT", "DI", "GT", "SY"))

df1 <- df1[order(df1$Category), ]

# Data prep data.table
df2 <- df_large

df2$Category <- factor(df2$Category, levels = c("PT", "DI", "GT", "SY"))

setDT(df2)

结果：

library(microbenchmark)
microbenchmark(df1[!duplicated(df1[,c('Name', 'Course')]), ], 
               df2[, .SD[which.min(df2$Category)], by=.(Name, Course)])

Unit: milliseconds
                                                      expr       min        lq      mean
            df1[!duplicated(df1[, c("Name", "Course")]), ] 1696.7585 1719.4932 1788.5821
 df2[, .SD[which.min(df2$Category)], by = .(Name, Course)]  387.8435  409.9365  436.4381
    median        uq       max neval
 1774.3131 1803.7565 2085.9722   100
  427.6739  451.1776  558.2749   100

数据：

df = structure(list(Name = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 4L, 
4L, 4L, 4L, 1L, 3L), .Label = c("James", "Jason", "John", "Nancy"
), class = "factor"), Course = structure(c(2L, 2L, 2L, 2L, 1L, 
1L, 2L, 2L, 1L, 1L, 2L, 1L), .Label = c("DS", "ML"), class = "factor"), 
    Category = structure(c(3L, 1L, 2L, 4L, 4L, 1L, 3L, 4L, 1L, 
    2L, 4L, 2L), .Label = c("DI", "GT", "PT", "SY"), class = "factor")), .Names = c("Name", 
"Course", "Category"), class = "data.frame", row.names = c("1:", 
"2:", "3:", "4:", "5:", "6:", "7:", "8:", "9:", "10:", "11:", 
"12:"))

【讨论】：

【解决方案3】：

您不是基于category 删除，您实际上是在尝试从数据框中删除完整的重复行。

您可以通过子集数据框来删除完全重复的行：

base R:
df_without_dupes <- df[!duplicated(df),]

【讨论】：

我通常使用unique 删除重复项，但您的解决方案如何回答我的问题？
它将您的原始数据框子集以删除重复项，唯一的将起作用。我正要添加因子排序，但Cihan已经添加了。

【解决方案4】：

我建议为此使用 dplyr 包

见下文：

require(dplyr)

data %>% 
  mutate(
    Category_factored=as.numeric(factor(Category,levels=c('PT','DI','GT','SY'),labels=1:4))
  ) %>% 
  group_by(Name,Course) %>% 
  filter(
    Category_factored == min(Category_factored)
  )

如果您是 R 新手，请使用 install.packages('dplyr') 安装 dplyr

【讨论】：

【解决方案5】：

您需要创建一个索引来表示类别的顺序。然后根据您的类别的优先级进行排序，并按名称和课程进行重复数据删除。

library(tidyverse)

#create index to sort by
index.df <- data.frame("Cateory" = c('PT',"DI","GT","SY"), "Index" = c(1,2,3,4))

#join to orig dataset
data <- left_join(data, index.df, by = "Cateory")

#sort by index, dedup with Name and Course
data %>% arrange(Index) %>% group_by(Name,Course) %>% 
distinct(Name,Course, .keep_all = TRUE) %>% select(-Index)

【讨论】：

【解决方案6】：

给定解决方案的快速基准测试：

library(microbenchmark)
library(tidyverse)
library(data.table)

# 1. Data set
df_raw <- data.frame(
  name = c("Jason", "Jason", "Jason", "Jason", "Jason", "Jason", "Nancy", "Nancy", "Nancy", "Nancy", "James", "John"),
  course = c("ML", "ML", "ML", "ML", "DS", "DS", "ML", "ML", "DS", "DS", "ML", "DS"),
  category = c("PT", "DI", "GT", "SY", "SY", "DI", "PT", "SY", "DI", "GT", "SY", "GT"),
  stringsAsFactors = FALSE)

 # 3. Solution 'basic R'
 f1 <- function(){

 # 1. Create data set  
  df <- df_raw

 # 2. Convert 'category' as factor
 df$category <- factor(df$category, levels = c("PT", "DI", "GT", "SY"))

 # 3. Sort by 'category'
 df <- df[order(df$category), ]

 # 4. Select rows without duplicates by 'name' and 'course'
 df[!duplicated(df[,c('name', 'course')]), ]

}

# 4. Solution 'dplyr'
f2 <- function(){
  # 1. Create data set
  df <- df_raw

  # 2. Solution
  df_raw %>% 
    mutate(category_factored = as.numeric(factor(category, levels = c('PT','DI','GT','SY'), labels = 1:4))) %>% 
    group_by(name, course) %>% 
    filter(category_factored == min(category_factored))
}

# 5. Solution 'data.table'
f3 <- function(){
  # 1. Create data set
  df <- df_raw

  # 2. Solution
  setDT(df)[, .SD[which.min(factor(category, levels = c("PT","DI","GT","SY")))], by=.(name, course)]
}

# 6. Solution 'dplyr'
f4 <- function(){

  # 1. Create data set
  df <- df_raw

  # 2. Create 'index' to sort by
  df_index <- data.frame("category" = c('PT',"DI","GT","SY"), "index" = c(1, 2, 3, 4))

  # 3. Join to original dataset
  df <- left_join(df, df_index, by = "category")

  # 4. Sort by 'index', dedup with 'name' and 'course'
  df %>% 
    arrange(index) %>% 
    group_by(name, course) %>% 
    distinct(name, course, .keep_all = TRUE) %>% 
    select(-index)
}

# Test for solutions
microbenchmark(f1(), f2(), f3(), f4())

Unit: milliseconds
expr       min        lq      mean    median        uq       max neval  cld
f1()  1.350875  1.468044  1.682641  1.603816  1.687203  5.006231   100 a   
f2() 12.547863 12.864521 13.766343 13.543806 14.227795 18.350335   100   c 
f3()  2.517014  2.634612  2.944483  2.792619  2.873013  9.355626   100  b  
f4() 21.073892 21.608212 23.246332 22.338600 23.934932 41.883938   100    d

如您所见，最好的解决方案是 f1() 和 f3()。

【讨论】：

请注意，这些基准测试基于小型数据集。在包含 1000 万行的数据集上进行测试，我的 data.table 解决方案的速度是基础 R 解决方案的 4 倍以上。
哇！所以，从性能和代码风格来看，这是最好的解决方案。我绝对喜欢它。

【解决方案7】：

我可能会迟到，但我相信这是最简单的解决方案。既然你提到了 10m 行，我建议使用非常容易理解的 unique 函数实现 data.table

require("data.table")
df <- data.table("Name" = c("Jason", "Jason", "Jason", "Jason", "Jason", "Jason", "Nancy", "Nancy", "Nancy", "Nancy", "James", "John"), "Course" = c("ML", "ML", "ML", "ML", "DS", "DS", "ML", "ML", "DS", "DS", "ML", "DS"), "category" = c("PT", "DI", "GT", "SY", "SY", "DI", "PT", "SY", "DI", "GT", "SY", "GT"))

unique(df[, category := factor(category, levels = c("PT","DI","GT","SY"))][order(df$"category")], by = c("Name", "Course"))

    Name Course category
1: Jason     ML       PT
2: Nancy     ML       PT
3: Jason     DS       DI
4: Nancy     DS       DI
5:  John     DS       GT
6: James     ML       SY

【讨论】：