【问题标题】:removing rows of data based on multiple conditions根据多个条件删除数据行
【发布时间】:2021-05-17 05:53:46
【问题描述】:

以下是我当前和想要的数据集。当 date priority IDrevenue 相同 code 不同时,我只想保留“最高”代码的行。

代码的层次结构如下:B>A>C。如果有任何B,不管在字符串的哪个位置,它都被分配到层次结构1。

large_df_have
   ID      Date Priority Revenue Code  V1  V2  V3
1 418 1/01/2020        1    -866    A XX3 XX1 XX3
2 418 1/01/2020        1    -866   AB XX2 XX2 XX3
3 418 1/01/2020        1    -866    A XX3 XX1 XX3


large_df_want
   ID      Date Priority Revenue Code  V1  V2  V3
2 418 1/01/2020        1    -866   AB XX2 XX2 XX3

【问题讨论】:

  • 我理解 B>A>C ,但 AB 如何融入该层次结构?
  • 我会更新问题以更清楚
  • 如果你同时拥有'AB''B' 会发生什么?你优先考虑哪一个?

标签: r


【解决方案1】:

这样就可以了

  • 根据给定条件创建一个虚拟列以在代码中创建heirarchy
  • 然后只过滤in这些组中优先级最高的行
  • 删除虚拟列(如果不需要,请选择 (-..)。
large_df_have <- read.table(text = '   ID      Date Priority Revenue Code  V1  V2  V3
1 418 1/01/2020        1    -866    A XX3 XX1 XX3
2 418 1/01/2020        1    -866   AB XX2 XX2 XX3
3 418 1/01/2020        1    -866    A XX3 XX1 XX3', header = T)

library(tidyverse)
large_df_have %>% group_by(ID, Date, Priority, Revenue) %>%
  mutate(priority_code = case_when(str_detect(Code, 'B') ~ 1,
                                   str_detect(Code, 'A') ~ 2,
                                   str_detect(Code, 'C') ~ 3,
                                   TRUE ~ 4)) %>%
  filter(priority_code == min(priority_code))
#> # A tibble: 1 x 9
#> # Groups:   ID, Date, Priority, Revenue [1]
#>      ID Date      Priority Revenue Code  V1    V2    V3    priority_code
#>   <int> <chr>        <int>   <int> <chr> <chr> <chr> <chr>         <dbl>
#> 1   418 1/01/2020        1    -866 AB    XX2   XX2   XX3               1

检查更复杂的情况

large_df_have <- read.table(text = '   ID      Date Priority Revenue Code  V1  V2  V3
1 418 1/01/2020        1    -866    A XX3 XX1 XX3
2 418 1/01/2020        1    -866   AB XX2 XX2 XX3
3 418 1/01/2020        1    -866    A XX3 XX1 XX3
4 419 1/01/2020        1    -866    C XX3 XX1 XX3
5 420 1/01/2020        1    -866    A XX3 XX1 XX3
6 420 1/01/2020        1    -866    C XX3 XX1 XX3', header = T)

library(tidyverse)
large_df_have %>% group_by(ID, Date, Priority, Revenue) %>%
  mutate(priority_code = case_when(str_detect(Code, 'B') ~ 1,
                                   str_detect(Code, 'A') ~ 2,
                                   str_detect(Code, 'C') ~ 3,
                                   TRUE ~ 4)) %>%
  filter(priority_code == min(priority_code))
#> # A tibble: 3 x 9
#> # Groups:   ID, Date, Priority, Revenue [3]
#>      ID Date      Priority Revenue Code  V1    V2    V3    priority_code
#>   <int> <chr>        <int>   <int> <chr> <chr> <chr> <chr>         <dbl>
#> 1   418 1/01/2020        1    -866 AB    XX2   XX2   XX3               1
#> 2   419 1/01/2020        1    -866 C     XX3   XX1   XX3               3
#> 3   420 1/01/2020        1    -866 A     XX3   XX1   XX3               2

reprex package 创建于 2021-05-17 (v2.0.0)

【讨论】:

    【解决方案2】:

    根据您拥有的Code 值的数量,您可以创建一个额外的数字列来反映层次结构(以防非字母顺序导致混淆)。如果您不熟悉,请查看?ifelse,但语法是ifelse(test, yes, no),如果testTRUE,则返回的值是yes 指定的值,否则它是no

    large_df_have %>%
       mutate(
          Code2 = ifelse(Code == 'B', 1, NA), # for the first time we make the Code2 column, 'no' values need to be NA 
          Code2 = ifelse(Code == 'A', 2, Code2), # rather than the 'no' result being NA as above, we keep any pre-existing values, eg the ones we just made in the line above
          Code2 = ifelse(Code == 'C', 3, Code2), 
    
          # and for your AB values (or others)
          Code2 = ifelse(Code == 'AB', 1.5, Code2)
          ) %>%
       
       # we create a group, of which we want the highest value of Code
       group_by(date, priority, ID, revenue) %>%
    
       # then we use filter() to keep the highest ranking rows for each group
       filter(
          Code2 == min(Code2)
       )
    

    如果没有您的数据,我无法对其进行测试,但这种方法应该可行。

    【讨论】:

      【解决方案3】:

      如果没有更多上下文,很难提供完全符合要求的代码。

      根据您的问题,我想到了两个选项。

      选项 1。您希望 AB 的排名与 B 相同(例如)。

      选项 2。您希望 AB 与 B 的排名不同(例如)。

      选项 1 显然存在问题,因为您使用的最后一行将基于 它在原始数据集中出现的顺序。如果代码列选项 2 可能会更好 表示错误。例如,如果 ID 为 418 的系统有错误代码 A 和 B, 这比错误代码 B 更糟糕。

      library(dplyr)
      
      df_have <- data.frame(ID   = c(418, 418, 418),
                            Date = c("1/01/2020", "1/01/2020","1/01/2020"), 
                            Priority = c(1, 1, 1), 
                            Revenue = c(-866, -866, -866), 
                            Code = c("A", "AB", "A"),
                            V1 = c("XX3", "XX2", "XX3"), 
                            V2 = c("XX1", "XX2", "XX1"),
                            V3 = c("XX3", "XX3", "XX3"))
      
      
      # Option 1. Rank AB the same as B (for example)
      df_want.1 <- df_have %>% 
        # add a numeric score based on the B > A > C ordering
        mutate(score = case_when(
          
          grepl("B", Code) ~ 3, 
          grepl("A", Code) ~ 2, 
          grepl("C", Code) ~ 1, 
          
        )) %>% 
        # group by Date, Priority, ID, Revenue (since you want the row with the highest code)
        group_by(Date, Priority, ID, Revenue) %>% 
        # only keep the row for the group which has the highest score (or highest code)
        filter(score == max(score)) %>% 
        # AB and B will both produce a score of 3, so we only keep one of the rows in the group
        distinct(Date, Priority, ID, Revenue, .keep_all = TRUE) %>% 
        ungroup()
        
      df_want.1
      
      # Option 2. Rank AB above B (for example)
      df_want.2 <- df_have %>% 
        # add a numeric score based on the B > A > C ordering
        mutate(score_b = if_else(grepl("B", Code), 3, 0), 
               score_a = if_else(grepl("A", Code), 2, 0), 
               score_c = if_else(grepl("C", Code), 1, 0)) %>% 
        # group by Date, Priority, ID, Revenue (since you want the row with the highest code)
        group_by(Date, Priority, ID, Revenue) %>% 
        # add each of the scores together 
        mutate(row_score = score_b + score_a + score_c) %>% 
        # only keep the row for the group which has the highest score (or highest code combination)
        filter(row_score == max(row_score)) %>% 
        # assuming it's possible to have the same score across the group, only keep first row in the group
        distinct(Date, Priority, ID, Revenue, .keep_all = TRUE) %>% 
        ungroup()
      
      df_want.2
      

      【讨论】:

        【解决方案4】:

        你可以创建一个因子,然后使用dplyr::distinct()

        library(dplyr)
        
        df_have <- data.frame(ID   = c(418, 418, 418),
                              Date = c("1/01/2020", "1/01/2020","1/01/2020"), 
                              Priority = c(1, 1, 1), 
                              Revenue = c(-866, -866, -866), 
                              Code = c("A", "AB", "A"),
                              V1 = c("XX3", "XX2", "XX3"), 
                              V2 = c("XX1", "XX2", "XX1"),
                              V3 = c("XX3", "XX3", "XX3"))
        
        # Hierarchy of combinations
        hierarchy <- c("B", "BA", "AB", "A", "C")
        
        # Create factor
        df_have %>%
          mutate(Code = factor(Code, levels = hierarchy)) %>%
          arrange(ID, Date, Priority, Revenue, Code) %>%
          distinct(ID, Date, Priority, Revenue, .keep_all = TRUE)
        #>    ID      Date Priority Revenue Code  V1  V2  V3
        #> 1 418 1/01/2020        1    -866   AB XX2 XX2 XX3
        

        reprex package 创建于 2021-05-17 (v2.0.0)

        【讨论】:

          【解决方案5】:

          我认为这里还有一些小问题需要澄清,但是从这个老问题中自我剽窃:How to reclassify/replace values based on priority when there are repeats,我认为使用有序因子是有意义的:

          library(data.table)
          
          ## set the order (small to large)
          lev <- c("C","A","B")
          setDT(have)
          
          have[, ord := ordered(sapply(strsplit(Code, ""), 
                                function(x) max(ordered(x,levels=lev))), levels=lev)]
          have[ have[, which.max(ord), by=.(ID, Date, Priority, Revenue)]$V1, ]
          
          #    ID      Date Priority Revenue Code  V1  V2  V3 ord
          #1: 418 1/01/2020        1    -866   AB XX2 XX2 XX3   B
          #2: 418 1/01/2020        1    -866    A XX3 XX1 XX3   A
          

          将此扩展数据用于两组:

          have <- read.table(text="
          ID      Date Priority Revenue Code  V1  V2  V3
          418 1/01/2020        1    -866    A XX3 XX1 XX3
          418 1/01/2020        1    -866   AB XX2 XX2 XX3
          418 1/01/2020        1    -866    A XX3 XX1 XX3
          419 1/01/2020        1    -866    A XX3 XX1 XX3
          419 1/01/2020        1    -866    A XX2 XX2 XX3
          419 1/01/2020        1    -866    C XX3 XX1 XX3
          ", header=TRUE, stringsAsFactors=FALSE)
          

          【讨论】:

            猜你喜欢
            • 1970-01-01
            • 1970-01-01
            • 1970-01-01
            • 1970-01-01
            • 1970-01-01
            • 2022-10-14
            • 2018-01-21
            • 2022-11-18
            • 2018-03-03
            相关资源
            最近更新 更多