如何匹配组内的负数和正数？答案

【问题标题】：How to match negative and positive numbers within a group?如何匹配组内的负数和正数？
【发布时间】：2020-12-22 07:56:28
【问题描述】：

我有一个像下面这样的数据框

ItemNo    OrderAmount     Date
3845         320       2020-01-21
3245        -100       2020-01-20
4045        -200       2020-01-20
3845         300       2020-01-19
3845         320       2020-01-18
3245         100       2020-01-18
3645         230       2020-01-18
3645        -230       2020-01-18
3245        -100       2020-01-18
3845         320       2020-01-17
4045           0       2020-01-17
3845         320       2020-01-17
3845        -300       2020-01-17
3245         200       2020-01-17
3645         230       2020-01-16
4045         200       2020-01-15
3845        -300       2020-01-15
3245         100       2020-01-15
3245         100       2020-01-15
3845         320       2020-01-15
4045         240       2020-01-15
4045           0       2020-01-15

我希望匹配一个组（ItemNo）中的负数和正数，然后从数据框中删除匹配的行。如果 OrderAmount 为 0，则将其保存在数据框中。我希望输出是

ItemNo    OrderAmount     Date
3845          320       2020-01-21
3845          320       2020-01-18
3245          100       2020-01-18
3645          230       2020-01-18
3845          320       2020-01-17
4045            0       2020-01-17
3845          320       2020-01-17
3845         -300       2020-01-17
3245          200       2020-01-17
3845          320       2020-01-15
4045          240       2020-01-15
4045            0       2020-01-15

我尝试过使用：

# Dataframe
DF <- data.frame(ItemNo=c(3845,3245,4045,3845,3845,3245,3645,3645,3245,3845,4045,3845,3845,3245,3645,4045,3845,3245,3245,3845,4045,4045),
                 OrderAmount = c(320,-100,-200,300,320,100,230,-230,-100,320,0,320,-300,200,230,200,-300,100,100,320,240,0),
                 Date = c("2020-01-21","2020-01-20","2020-01-20","2020-01-19","2020-01-18","2020-01-18","2020-01-18","2020-01-18","2020-01-18","2020-01-17","2020-01-17","2020-01-17","2020-01-17","2020-01-17","2020-01-16","2020-01-15","2020-01-15","2020-01-15","2020-01-15","2020-01-15","2020-01-15","2020-01-15"))
DF$Date <- as.Date(DF$Date)
# Order by date -> oldest first
DF <- DF[order(desc(DF$Date)),]

DF %>%
  group_by(ItemNo, nvalue = abs(OrderAmount)) %>% 
  filter(!duplicated(OrderAmount)) %>%
  filter(sum(OrderAmount) >= 0) %>% 
  ungroup() %>%
  select(-nvalue)

例如我希望 ItemNo 3245 的输出为

ItemNo    OrderAmount     Date
3245        -100       2020-01-20 #DELETE: Matched with 5. Row
3245         100       2020-01-18
3245        -100       2020-01-18 #DELETE: Matched with 6. Row
3245         200       2020-01-17 
3245         100       2020-01-15 #DELETE: Matched with 1. Row
3245         100       2020-01-15 #DELETE: Matched with 3. Row

【问题讨论】：

请准确描述匹配模式应该是什么。例如。我不清楚为什么输入数据框中的 4. 行被删除。另外，请提供：stackoverflow.com/help/minimal-reproducible-example
第4行与第17行匹配，因为它在同一组中，并且是相同的OrderAmount，符号相反。 OrderAmount 需要在同一组内进行一对一匹配，符号相反。

标签： r dataframe dplyr

【解决方案1】：

这是一个棘手的算法，无需迭代构建或清除数据框即可解决，这会很慢。这是一种可能的解决方案，但它相当复杂：

balanced_table <- function(d) {
  d <- d[order(d$Date),]
  x <- d$OrderAmount
  x <- factor(x, c(unique(abs(x[x != 0])), 0, -unique(abs(x[x != 0]))))
  x <- table(x)
  neg <- -x[as.numeric(names(x)) < 0]
  pos <- x[as.numeric(names(x)) > 0]
  names(neg) <- -as.numeric(names(neg))
  totals <- pos + neg
  final <- c(totals, x[as.numeric(names(x)) == 0])
  names(final)[final < 0] <- -as.numeric(names(final)[final < 0])
  final[final < 0] <- -final[final < 0]
  final
  res <- tidyr::uncount(as.data.frame(as.table(final[final != 0])), Freq)
  vals <- as.numeric(as.character(res$Var1))
  do.call(rbind, lapply(split(vals, vals), function(v) {
    d[which(d$OrderAmount == v[1])[seq_along(v)],]
  }))
}

`rownames<-`(do.call(rbind, lapply(split(df, df$ItemNo), balanced_table)), NULL)
#>    ItemNo OrderAmount       Date
#> 1    3245         100 2020-01-15
#> 2    3245         200 2020-01-17
#> 3    3645         230 2020-01-16
#> 4    3845        -300 2020-01-15
#> 5    3845         320 2020-01-15
#> 6    3845         320 2020-01-17
#> 7    3845         320 2020-01-17
#> 8    3845         320 2020-01-18
#> 9    3845         320 2020-01-21
#> 10   4045           0 2020-01-15
#> 11   4045           0 2020-01-17
#> 12   4045         240 2020-01-15

或者，保持原始顺序，从而匹配您问题的预期输出：

df$row <- seq(nrow(df))
df2 <- do.call(rbind, lapply(split(df, df$ItemNo), balanced_table))
`rownames<-`(df2[order(df2$row), names(df2) != "row"], NULL)
#>    ItemNo OrderAmount       Date
#> 1    3845         320 2020-01-21
#> 2    3845         320 2020-01-18
#> 3    3845         320 2020-01-17
#> 4    4045           0 2020-01-17
#> 5    3845         320 2020-01-17
#> 6    3245         200 2020-01-17
#> 7    3645         230 2020-01-16
#> 8    3845        -300 2020-01-15
#> 9    3245         100 2020-01-15
#> 10   3845         320 2020-01-15
#> 11   4045         240 2020-01-15
#> 12   4045           0 2020-01-15

数据

df <- structure(list(ItemNo = c(3845L, 3245L, 4045L, 3845L, 3845L, 
3245L, 3645L, 3645L, 3245L, 3845L, 4045L, 3845L, 3845L, 3245L, 
3645L, 4045L, 3845L, 3245L, 3245L, 3845L, 4045L, 4045L), OrderAmount = c(320L, 
-100L, -200L, 300L, 320L, 100L, 230L, -230L, -100L, 320L, 0L, 
320L, -300L, 200L, 230L, 200L, -300L, 100L, 100L, 320L, 240L, 
0L), Date = structure(c(18282, 18281, 18281, 18280, 18279, 18279, 
18279, 18279, 18279, 18278, 18278, 18278, 18278, 18278, 18277, 
18276, 18276, 18276, 18276, 18276, 18276, 18276), class = "Date")), 
row.names = c(NA, -22L), class = "data.frame")

df
#>    ItemNo OrderAmount       Date
#> 1    3845         320 2020-01-21
#> 2    3245        -100 2020-01-20
#> 3    4045        -200 2020-01-20
#> 4    3845         300 2020-01-19
#> 5    3845         320 2020-01-18
#> 6    3245         100 2020-01-18
#> 7    3645         230 2020-01-18
#> 8    3645        -230 2020-01-18
#> 9    3245        -100 2020-01-18
#> 10   3845         320 2020-01-17
#> 11   4045           0 2020-01-17
#> 12   3845         320 2020-01-17
#> 13   3845        -300 2020-01-17
#> 14   3245         200 2020-01-17
#> 15   3645         230 2020-01-16
#> 16   4045         200 2020-01-15
#> 17   3845        -300 2020-01-15
#> 18   3245         100 2020-01-15
#> 19   3245         100 2020-01-15
#> 20   3845         320 2020-01-15
#> 21   4045         240 2020-01-15
#> 22   4045           0 2020-01-15

【讨论】：

谢谢，这似乎也有效。我找到了一种不太复杂的方法，并将其作为答案发布。

【解决方案2】：

我试过这个，它似乎工作：

DF <- data.frame(ItemNo=c(3845,3245,4045,3845,3845,3245,3645,3645,3245,3845,4045,3845,3845,3245,3645,4045,3845,3245,3245,3845,4045,4045),
                 OrderAmount = c(320,-100,-200,300,320,100,230,-230,-100,320,0,320,-300,200,230,200,-300,100,100,320,240,0),
                 Date = c("2020-01-21","2020-01-20","2020-01-20","2020-01-19","2020-01-18","2020-01-18","2020-01-18","2020-01-18","2020-01-18","2020-01-17","2020-01-17","2020-01-17","2020-01-17","2020-01-17","2020-01-16","2020-01-15","2020-01-15","2020-01-15","2020-01-15","2020-01-15","2020-01-15","2020-01-15"))

DF$Date <- as.character(DF$Date)
DF <- DF[order(DF$Date),]
DF$rowNumber <- 1:nrow(DF)
DF_TEMP <- DF %>%
    group_by(ItemNo, OrderAmount) %>% 
    mutate(id_row = row_number()) %>%
    group_by(ItemNo, id_row, ab = abs(OrderAmount)) %>%
    filter(n() > 1) %>% ungroup() %>%
    select(-id_row, -ab)
DF <- DF[!(DF$rowNumber %in% DF_TEMP$rowNumber),]

【讨论】：