根据多个条件删除数据行答案

【问题标题】：removing rows of data based on multiple conditions根据多个条件删除数据行
【发布时间】：2021-05-17 05:53:46
【问题描述】：

以下是我当前和想要的数据集。当 date priority ID 和 revenue 相同但 code 不同时，我只想保留“最高”代码的行。

代码的层次结构如下：B>A>C。如果有任何B，不管在字符串的哪个位置，它都被分配到层次结构1。

large_df_have
   ID      Date Priority Revenue Code  V1  V2  V3
1 418 1/01/2020        1    -866    A XX3 XX1 XX3
2 418 1/01/2020        1    -866   AB XX2 XX2 XX3
3 418 1/01/2020        1    -866    A XX3 XX1 XX3


large_df_want
   ID      Date Priority Revenue Code  V1  V2  V3
2 418 1/01/2020        1    -866   AB XX2 XX2 XX3

【问题讨论】：

我理解 B>A>C ，但 AB 如何融入该层次结构？
我会更新问题以更清楚
如果你同时拥有'AB' 和'B' 会发生什么？你优先考虑哪一个？

标签： r

【解决方案1】：

这样就可以了

根据给定条件创建一个虚拟列以在代码中创建heirarchy
然后只过滤in这些组中优先级最高的行
删除虚拟列（如果不需要，请选择 (-..)。

large_df_have <- read.table(text = '   ID      Date Priority Revenue Code  V1  V2  V3
1 418 1/01/2020        1    -866    A XX3 XX1 XX3
2 418 1/01/2020        1    -866   AB XX2 XX2 XX3
3 418 1/01/2020        1    -866    A XX3 XX1 XX3', header = T)

library(tidyverse)
large_df_have %>% group_by(ID, Date, Priority, Revenue) %>%
  mutate(priority_code = case_when(str_detect(Code, 'B') ~ 1,
                                   str_detect(Code, 'A') ~ 2,
                                   str_detect(Code, 'C') ~ 3,
                                   TRUE ~ 4)) %>%
  filter(priority_code == min(priority_code))
#> # A tibble: 1 x 9
#> # Groups:   ID, Date, Priority, Revenue [1]
#>      ID Date      Priority Revenue Code  V1    V2    V3    priority_code
#>   <int> <chr>        <int>   <int> <chr> <chr> <chr> <chr>         <dbl>
#> 1   418 1/01/2020        1    -866 AB    XX2   XX2   XX3               1

检查更复杂的情况

large_df_have <- read.table(text = '   ID      Date Priority Revenue Code  V1  V2  V3
1 418 1/01/2020        1    -866    A XX3 XX1 XX3
2 418 1/01/2020        1    -866   AB XX2 XX2 XX3
3 418 1/01/2020        1    -866    A XX3 XX1 XX3
4 419 1/01/2020        1    -866    C XX3 XX1 XX3
5 420 1/01/2020        1    -866    A XX3 XX1 XX3
6 420 1/01/2020        1    -866    C XX3 XX1 XX3', header = T)

library(tidyverse)
large_df_have %>% group_by(ID, Date, Priority, Revenue) %>%
  mutate(priority_code = case_when(str_detect(Code, 'B') ~ 1,
                                   str_detect(Code, 'A') ~ 2,
                                   str_detect(Code, 'C') ~ 3,
                                   TRUE ~ 4)) %>%
  filter(priority_code == min(priority_code))
#> # A tibble: 3 x 9
#> # Groups:   ID, Date, Priority, Revenue [3]
#>      ID Date      Priority Revenue Code  V1    V2    V3    priority_code
#>   <int> <chr>        <int>   <int> <chr> <chr> <chr> <chr>         <dbl>
#> 1   418 1/01/2020        1    -866 AB    XX2   XX2   XX3               1
#> 2   419 1/01/2020        1    -866 C     XX3   XX1   XX3               3
#> 3   420 1/01/2020        1    -866 A     XX3   XX1   XX3               2

^{由reprex package 创建于 2021-05-17 (v2.0.0)}

【讨论】：

【解决方案2】：

根据您拥有的Code 值的数量，您可以创建一个额外的数字列来反映层次结构（以防非字母顺序导致混淆）。如果您不熟悉，请查看?ifelse，但语法是ifelse(test, yes, no)，如果test 是TRUE，则返回的值是yes 指定的值，否则它是no。

large_df_have %>%
   mutate(
      Code2 = ifelse(Code == 'B', 1, NA), # for the first time we make the Code2 column, 'no' values need to be NA 
      Code2 = ifelse(Code == 'A', 2, Code2), # rather than the 'no' result being NA as above, we keep any pre-existing values, eg the ones we just made in the line above
      Code2 = ifelse(Code == 'C', 3, Code2), 

      # and for your AB values (or others)
      Code2 = ifelse(Code == 'AB', 1.5, Code2)
      ) %>%
   
   # we create a group, of which we want the highest value of Code
   group_by(date, priority, ID, revenue) %>%

   # then we use filter() to keep the highest ranking rows for each group
   filter(
      Code2 == min(Code2)
   )

如果没有您的数据，我无法对其进行测试，但这种方法应该可行。

【讨论】：

【解决方案3】：

如果没有更多上下文，很难提供完全符合要求的代码。

根据您的问题，我想到了两个选项。

选项 1。您希望 AB 的排名与 B 相同（例如）。

选项 2。您希望 AB 与 B 的排名不同（例如）。

选项 1 显然存在问题，因为您使用的最后一行将基于它在原始数据集中出现的顺序。如果代码列选项 2 可能会更好表示错误。例如，如果 ID 为 418 的系统有错误代码 A 和 B，这比错误代码 B 更糟糕。

library(dplyr)

df_have <- data.frame(ID   = c(418, 418, 418),
                      Date = c("1/01/2020", "1/01/2020","1/01/2020"), 
                      Priority = c(1, 1, 1), 
                      Revenue = c(-866, -866, -866), 
                      Code = c("A", "AB", "A"),
                      V1 = c("XX3", "XX2", "XX3"), 
                      V2 = c("XX1", "XX2", "XX1"),
                      V3 = c("XX3", "XX3", "XX3"))


# Option 1. Rank AB the same as B (for example)
df_want.1 <- df_have %>% 
  # add a numeric score based on the B > A > C ordering
  mutate(score = case_when(
    
    grepl("B", Code) ~ 3, 
    grepl("A", Code) ~ 2, 
    grepl("C", Code) ~ 1, 
    
  )) %>% 
  # group by Date, Priority, ID, Revenue (since you want the row with the highest code)
  group_by(Date, Priority, ID, Revenue) %>% 
  # only keep the row for the group which has the highest score (or highest code)
  filter(score == max(score)) %>% 
  # AB and B will both produce a score of 3, so we only keep one of the rows in the group
  distinct(Date, Priority, ID, Revenue, .keep_all = TRUE) %>% 
  ungroup()
  
df_want.1

# Option 2. Rank AB above B (for example)
df_want.2 <- df_have %>% 
  # add a numeric score based on the B > A > C ordering
  mutate(score_b = if_else(grepl("B", Code), 3, 0), 
         score_a = if_else(grepl("A", Code), 2, 0), 
         score_c = if_else(grepl("C", Code), 1, 0)) %>% 
  # group by Date, Priority, ID, Revenue (since you want the row with the highest code)
  group_by(Date, Priority, ID, Revenue) %>% 
  # add each of the scores together 
  mutate(row_score = score_b + score_a + score_c) %>% 
  # only keep the row for the group which has the highest score (or highest code combination)
  filter(row_score == max(row_score)) %>% 
  # assuming it's possible to have the same score across the group, only keep first row in the group
  distinct(Date, Priority, ID, Revenue, .keep_all = TRUE) %>% 
  ungroup()

df_want.2

【讨论】：

【解决方案4】：

你可以创建一个因子，然后使用dplyr::distinct()：

library(dplyr)

df_have <- data.frame(ID   = c(418, 418, 418),
                      Date = c("1/01/2020", "1/01/2020","1/01/2020"), 
                      Priority = c(1, 1, 1), 
                      Revenue = c(-866, -866, -866), 
                      Code = c("A", "AB", "A"),
                      V1 = c("XX3", "XX2", "XX3"), 
                      V2 = c("XX1", "XX2", "XX1"),
                      V3 = c("XX3", "XX3", "XX3"))

# Hierarchy of combinations
hierarchy <- c("B", "BA", "AB", "A", "C")

# Create factor
df_have %>%
  mutate(Code = factor(Code, levels = hierarchy)) %>%
  arrange(ID, Date, Priority, Revenue, Code) %>%
  distinct(ID, Date, Priority, Revenue, .keep_all = TRUE)
#>    ID      Date Priority Revenue Code  V1  V2  V3
#> 1 418 1/01/2020        1    -866   AB XX2 XX2 XX3

^{由reprex package 创建于 2021-05-17 (v2.0.0)}

【讨论】：

【解决方案5】：

我认为这里还有一些小问题需要澄清，但是从这个老问题中自我剽窃：How to reclassify/replace values based on priority when there are repeats，我认为使用有序因子是有意义的：

library(data.table)

## set the order (small to large)
lev <- c("C","A","B")
setDT(have)

have[, ord := ordered(sapply(strsplit(Code, ""), 
                      function(x) max(ordered(x,levels=lev))), levels=lev)]
have[ have[, which.max(ord), by=.(ID, Date, Priority, Revenue)]$V1, ]

#    ID      Date Priority Revenue Code  V1  V2  V3 ord
#1: 418 1/01/2020        1    -866   AB XX2 XX2 XX3   B
#2: 418 1/01/2020        1    -866    A XX3 XX1 XX3   A

将此扩展数据用于两组：

have <- read.table(text="
ID      Date Priority Revenue Code  V1  V2  V3
418 1/01/2020        1    -866    A XX3 XX1 XX3
418 1/01/2020        1    -866   AB XX2 XX2 XX3
418 1/01/2020        1    -866    A XX3 XX1 XX3
419 1/01/2020        1    -866    A XX3 XX1 XX3
419 1/01/2020        1    -866    A XX2 XX2 XX3
419 1/01/2020        1    -866    C XX3 XX1 XX3
", header=TRUE, stringsAsFactors=FALSE)

【讨论】：