【问题标题】:R: Remove duplicates from a dataframe based on categories in a columnR:根据列中的类别从数据框中删除重复项
【发布时间】:2017-11-28 19:54:31
【问题描述】:

这是我的示例数据集:

      Name Course Cateory
 1: Jason     ML      PT
 2: Jason     ML      DI
 3: Jason     ML      GT
 4: Jason     ML      SY
 5: Jason     DS      SY
 6: Jason     DS      DI
 7: Nancy     ML      PT
 8: Nancy     ML      SY
 9: Nancy     DS      DI
10: Nancy     DS      GT
11: James     ML      SY
12:  John     DS      GT

我想删除重复的行以在数据框中拥有唯一的行。删除重复行基于category 列中的值。 category 列中的值的偏好按此顺序 {'PT','DI','GT','SY'} 给出。

我的输出数据框如下所示:

  Name Course Cateory
1: Jason     ML      PT
2: Jason     DS      DI
3: Nancy     ML      PT
4: Nancy     DS      DI
5: James     ML      SY
6:  John     DS      GT

目前,我正在使用for 循环和if 条件的组合。由于输入数据框很大(1000 万行),因此需要很长时间。有没有更好更有效的方法来执行相同的操作?

【问题讨论】:

  • 根据NameCourse 列,您似乎正在删除。再次检查。
  • 在某种程度上你是对的。但删除仍然取决于category 列和特定顺序。
  • 这个问题并没有说清楚。首先按Category 列排序,然后根据NameCourse 删除重复项。

标签: r


【解决方案1】:

这是一个按照您的要求执行的 sn-p:

df$Category <- factor(df$Category, levels = c("PT", "DI", "GT", "SY"))

df <- df[order(df$Category),]

df[!duplicated(df[,c('Name', 'Course')]),]

输出:

Name Course Category
Jason     ML       PT
Nancy     ML       PT
Jason     DS       DI
Nancy     DS       DI
John      DS       GT
James     ML       SY

想法是我们根据优先级结构进行排序。然后我们应用唯一操作,这将返回第一个匹配项。回报将是我们想要的。

【讨论】:

    【解决方案2】:

    既然你提到你有 1000 万行,这里有一个data.table 解决方案:

    library(data.table)
    
    setDT(df)[, .SD[which.min(factor(Category, levels = c("PT","DI","GT","SY")))], by=.(Name, Course)]
    

    结果:

        Name Course Category
    1: Jason     ML       PT
    2: Jason     DS       DI
    3: Nancy     ML       PT
    4: Nancy     DS       DI
    5: James     ML       SY
    6:  John     DS       GT
    

    基准测试:

    # Random resampling of `df` to generate 10 million rows
    set.seed(123)
    df_large = data.frame(lapply(df, sample, 1e7, replace = TRUE))
    
    # Data prep Base R  
    df1 <- df_large
    
    df1$Category <- factor(df1$Category, levels = c("PT", "DI", "GT", "SY"))
    
    df1 <- df1[order(df1$Category), ]
    
    # Data prep data.table
    df2 <- df_large
    
    df2$Category <- factor(df2$Category, levels = c("PT", "DI", "GT", "SY"))
    
    setDT(df2)
    

    结果:

    library(microbenchmark)
    microbenchmark(df1[!duplicated(df1[,c('Name', 'Course')]), ], 
                   df2[, .SD[which.min(df2$Category)], by=.(Name, Course)])
    
    Unit: milliseconds
                                                          expr       min        lq      mean
                df1[!duplicated(df1[, c("Name", "Course")]), ] 1696.7585 1719.4932 1788.5821
     df2[, .SD[which.min(df2$Category)], by = .(Name, Course)]  387.8435  409.9365  436.4381
        median        uq       max neval
     1774.3131 1803.7565 2085.9722   100
      427.6739  451.1776  558.2749   100
    

    数据:

    df = structure(list(Name = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 4L, 
    4L, 4L, 4L, 1L, 3L), .Label = c("James", "Jason", "John", "Nancy"
    ), class = "factor"), Course = structure(c(2L, 2L, 2L, 2L, 1L, 
    1L, 2L, 2L, 1L, 1L, 2L, 1L), .Label = c("DS", "ML"), class = "factor"), 
        Category = structure(c(3L, 1L, 2L, 4L, 4L, 1L, 3L, 4L, 1L, 
        2L, 4L, 2L), .Label = c("DI", "GT", "PT", "SY"), class = "factor")), .Names = c("Name", 
    "Course", "Category"), class = "data.frame", row.names = c("1:", 
    "2:", "3:", "4:", "5:", "6:", "7:", "8:", "9:", "10:", "11:", 
    "12:"))
    

    【讨论】:

      【解决方案3】:

      您不是基于category 删除,您实际上是在尝试从数据框中删除完整的重复行。

      您可以通过子集数据框来删除完全重复的行:

      base R:
      df_without_dupes <- df[!duplicated(df),]
      

      【讨论】:

      • 我通常使用unique 删除重复项,但您的解决方案如何回答我的问题?
      • 它将您的原始数据框子集以删除重复项,唯一的将起作用。我正要添加因子排序,但Cihan已经添加了。
      【解决方案4】:

      我建议为此使用 dplyr

      见下文:

      require(dplyr)
      
      data %>% 
        mutate(
          Category_factored=as.numeric(factor(Category,levels=c('PT','DI','GT','SY'),labels=1:4))
        ) %>% 
        group_by(Name,Course) %>% 
        filter(
          Category_factored == min(Category_factored)
        )
      

      如果您是 R 新手,请使用 install.packages('dplyr') 安装 dplyr

      【讨论】:

        【解决方案5】:

        您需要创建一个索引来表示类别的顺序。然后根据您的类别的优先级进行排序,并按名称和课程进行重复数据删除。

        library(tidyverse)
        
        #create index to sort by
        index.df <- data.frame("Cateory" = c('PT',"DI","GT","SY"), "Index" = c(1,2,3,4))
        
        #join to orig dataset
        data <- left_join(data, index.df, by = "Cateory")
        
        #sort by index, dedup with Name and Course
        data %>% arrange(Index) %>% group_by(Name,Course) %>% 
        distinct(Name,Course, .keep_all = TRUE) %>% select(-Index)
        

        【讨论】:

          【解决方案6】:

          给定解决方案的快速基准测试:

          library(microbenchmark)
          library(tidyverse)
          library(data.table)
          
          # 1. Data set
          df_raw <- data.frame(
            name = c("Jason", "Jason", "Jason", "Jason", "Jason", "Jason", "Nancy", "Nancy", "Nancy", "Nancy", "James", "John"),
            course = c("ML", "ML", "ML", "ML", "DS", "DS", "ML", "ML", "DS", "DS", "ML", "DS"),
            category = c("PT", "DI", "GT", "SY", "SY", "DI", "PT", "SY", "DI", "GT", "SY", "GT"),
            stringsAsFactors = FALSE)
          
           # 3. Solution 'basic R'
           f1 <- function(){
          
           # 1. Create data set  
            df <- df_raw
          
           # 2. Convert 'category' as factor
           df$category <- factor(df$category, levels = c("PT", "DI", "GT", "SY"))
          
           # 3. Sort by 'category'
           df <- df[order(df$category), ]
          
           # 4. Select rows without duplicates by 'name' and 'course'
           df[!duplicated(df[,c('name', 'course')]), ]
          
          }
          
          # 4. Solution 'dplyr'
          f2 <- function(){
            # 1. Create data set
            df <- df_raw
          
            # 2. Solution
            df_raw %>% 
              mutate(category_factored = as.numeric(factor(category, levels = c('PT','DI','GT','SY'), labels = 1:4))) %>% 
              group_by(name, course) %>% 
              filter(category_factored == min(category_factored))
          }
          
          # 5. Solution 'data.table'
          f3 <- function(){
            # 1. Create data set
            df <- df_raw
          
            # 2. Solution
            setDT(df)[, .SD[which.min(factor(category, levels = c("PT","DI","GT","SY")))], by=.(name, course)]
          }
          
          # 6. Solution 'dplyr'
          f4 <- function(){
          
            # 1. Create data set
            df <- df_raw
          
            # 2. Create 'index' to sort by
            df_index <- data.frame("category" = c('PT',"DI","GT","SY"), "index" = c(1, 2, 3, 4))
          
            # 3. Join to original dataset
            df <- left_join(df, df_index, by = "category")
          
            # 4. Sort by 'index', dedup with 'name' and 'course'
            df %>% 
              arrange(index) %>% 
              group_by(name, course) %>% 
              distinct(name, course, .keep_all = TRUE) %>% 
              select(-index)
          }
          
          # Test for solutions
          microbenchmark(f1(), f2(), f3(), f4())
          
          Unit: milliseconds
          expr       min        lq      mean    median        uq       max neval  cld
          f1()  1.350875  1.468044  1.682641  1.603816  1.687203  5.006231   100 a   
          f2() 12.547863 12.864521 13.766343 13.543806 14.227795 18.350335   100   c 
          f3()  2.517014  2.634612  2.944483  2.792619  2.873013  9.355626   100  b  
          f4() 21.073892 21.608212 23.246332 22.338600 23.934932 41.883938   100    d
          

          如您所见,最好的解决方案是 f1()f3()

          【讨论】:

          • 请注意,这些基准测试基于小型数据集。在包含 1000 万行的数据集上进行测试,我的 data.table 解决方案的速度是基础 R 解决方案的 4 倍以上。
          • 哇!所以,从性能和代码风格来看,这是最好的解决方案。我绝对喜欢它。
          【解决方案7】:

          我可能会迟到,但我相信这是最简单的解决方案。既然你提到了 10m 行,我建议使用非常容易理解的 unique 函数实现 data.table

          require("data.table")
          df <- data.table("Name" = c("Jason", "Jason", "Jason", "Jason", "Jason", "Jason", "Nancy", "Nancy", "Nancy", "Nancy", "James", "John"), "Course" = c("ML", "ML", "ML", "ML", "DS", "DS", "ML", "ML", "DS", "DS", "ML", "DS"), "category" = c("PT", "DI", "GT", "SY", "SY", "DI", "PT", "SY", "DI", "GT", "SY", "GT"))
          
          unique(df[, category := factor(category, levels = c("PT","DI","GT","SY"))][order(df$"category")], by = c("Name", "Course"))
          
              Name Course category
          1: Jason     ML       PT
          2: Nancy     ML       PT
          3: Jason     DS       DI
          4: Nancy     DS       DI
          5:  John     DS       GT
          6: James     ML       SY
          

          【讨论】:

            猜你喜欢
            • 2021-05-14
            • 1970-01-01
            • 2013-04-04
            • 1970-01-01
            • 1970-01-01
            • 2019-11-19
            • 2018-12-13
            • 2021-12-02
            • 1970-01-01
            相关资源
            最近更新 更多