【问题标题】:How to get last N rows of each group in dplyr with missing rows as 0?如何在 dplyr 中获取每个组的最后 N 行,其中缺少的行为 0?
【发布时间】:2020-05-07 05:33:37
【问题描述】:

我有一个数据框,其中包含 id、类别、时间戳、数量、价格等列。我想按 ID、类别对数据进行分组,然后获取数量、价格的最后 3 个值,然后对表格进行透视。

library(dplyr)
dummy <- data.frame("ID" = c(1,1,2,2,3),
                    "category"=c("A","A", "B", "A", "C"),
                    "timestamp"=as.Date(c("2020-04-05", "2020-04-10", "2020-03-01", "2020-01-01", "2020-01-10")),
                    "Quantity"=c(1,5,6,7,4),
                    "price"=c(10.2, 45.6, 70.3, 23.4, 10))
> dummy
  ID category  timestamp Quantity price
1  1        A 2020-04-05        1  10.2
2  1        A 2020-04-10        5  45.6
3  2        B 2020-03-01        6  70.3
4  2        A 2020-01-01        7  23.4
5  3        C 2020-01-10        4  10.0

我想选择每个客户类别的最后 3 行。如果只有一到两行 proesnet 则用 0 填充缺失的行。

dummy2 <- data.frame("ID" = c(1,2,2,3),"category" = c("A","B", "A", "C"),
                     "Quantity1" = c(0,0,0,0),"Quantity2" = c(1,0,0,0),"Quantity3" = c(5,6,7,4),
                     "price1" = c(0,0,0,0),"price2" = c(10.2,0,0,0),"price3" = c(45.6, 70.3, 23.4, 10.0))

> dummy2
  ID category Quantity1 Quantity2 Quantity3 price1 price2 price3
1  1        A         0         1         5      0   10.2   45.6
2  2        B         0         0         6      0    0.0   70.3
3  2        A         0         0         7      0    0.0   23.4
4  3        C         0         0         4      0    0.0   10.0

这里的数量1、数量2、数量3代表每个IDx类别的(last-2、last-1、last)行值。 我试过dummy %&gt;% group_by(ID, category) %&gt;% dplyr::top_n(-3, wt = timestamp) %&gt;% select(Quantity, price) 之后我不知道该怎么办。请提出解决方案

【问题讨论】:

    标签: r dataframe dplyr


    【解决方案1】:

    这是一种方法:

    library(dplyr)
    library(tidyr)
    
    dummy %>%
      group_by(ID, category) %>%
      #Get top 3 timestamp values
      top_n(3, timestamp) %>%
      select(-timestamp) %>%
      mutate(row = rev(3 - row_number() + 1)) %>%
      complete(row = 1:3, fill = list(Quantity = 0, price = 0)) %>%
      pivot_wider(names_from = row, values_from = c(Quantity, price))
    
    
    #     ID category Quantity_1 Quantity_2 Quantity_3 price_1 price_2 price_3
    #  <dbl> <chr>         <dbl>      <dbl>      <dbl>   <dbl>   <dbl>   <dbl>
    #1     1 A                 0          1          5       0    10.2    45.6
    #2     2 A                 0          0          7       0     0      23.4
    #3     2 B                 0          0          6       0     0      70.3
    #4     3 C                 0          0          4       0     0      10  
    

    【讨论】:

    • 谢谢。这适用于较小的数据,适用于 1m 行的任何有效解决方案。我可以使用 sparkR 并做同样的事情吗?
    • 您在IDcategory 中只选择了3 行,dummy %&gt;% group_by(ID, category) %&gt;% top_n(3, timestamp) 之后您有多少行?
    • 小于等于3*长度(unique(customer_id * category))
    • dummy %&gt;% group_by(ID, category) %&gt;% top_n(3, timestamp) %&gt;% nrow 的输出是什么?
    • 我正在测试 100K 行数据,所以如果在命令上方运行,我会得到 110782
    【解决方案2】:

    在示例中最好有一个超过 3 行的组。

    我创建了一个 5 行的虚拟组。

    library(data.table)
    dummy <- data.frame("ID" = c(1,1,2,2,3),
                        "category"=c("A","A", "B", "A", "C"),
                        "timestamp"=as.Date(c("2020-04-05", "2020-04-10", "2020-03-01", "2020-01-01", "2020-01-10")),
                        "Quantity"=c(1,5,6,7,4),
                        "price"=c(10.2, 45.6, 70.3, 23.4, 10))
    
    
    dummy2 <- data.frame("ID" = c(4,4,4,4,4),
                        "category"=c("A","A", "A", "A", "A"),
                        "timestamp"=as.Date(c("2020-04-05", "2020-04-10", "2020-03-01", "2020-01-01", "2020-01-10")),
                        "Quantity"=c(1,5,6,7,4),
                        "price"=c(10.2, 45.6, 70.3, 23.4, 10))
    
    
    dt <- rbindlist(list(dummy,dummy2))
    setorder(dt,ID,category,-timestamp)[,grp:=paste0(ID,category)]
    
    result <- dcast(dt[dt[,head(.I,3),by=.(grp)]$V1],ID+category~4-rowid(grp),value.var = c("Quantity","price"))
    
    #replace NA as 0
    # looks like you really care about performance so i am going to use set
    for (j in seq_len(ncol(result))){
      set(result,which(is.na(result[[j]])),j,0)
    }
    
    result
    #>    ID category Quantity_1 Quantity_2 Quantity_3 price_1 price_2 price_3
    #> 1:  1        A          0          1          5     0.0    10.2    45.6
    #> 2:  2        A          0          0          7     0.0     0.0    23.4
    #> 3:  2        B          0          0          6     0.0     0.0    70.3
    #> 4:  3        C          0          0          4     0.0     0.0    10.0
    #> 5:  4        A          6          1          5    70.3    10.2    45.6
    

    reprex package (v0.3.0) 于 2020-05-07 创建

    【讨论】:

    • 我必须考虑所有组,即使它们有 1 行。而且这个解决方案看起来计算量更大。
    • 只要你有一组 >= 3 行,这个解决方案就可以工作。
    • 您可以将一个 3 行的虚拟组添加到您的数据集中。尽管您的所有组都
    猜你喜欢
    • 2020-08-22
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2017-04-19
    • 1970-01-01
    • 1970-01-01
    • 2021-02-28
    • 1970-01-01
    相关资源
    最近更新 更多