【问题标题】:Data frame in R: converting table into predifined structure [duplicate]R中的数据框:将表转换为预定义结构
【发布时间】:2023-03-03 18:28:01
【问题描述】:

我在 R 中的数据整理有问题。所以我有一个这样的数据框:

        CardID       Date Amount ItemNumber    ItemCode
1  C0100000111 2001-07-19 449.00          1 I0000000808
2  C0100000111 2001-02-20   9.99          1 I0000000622
3  C0100000111 2001-04-27  49.99          1 I0000000284
4  C0100000111 2001-02-20  69.00          1 I0000000488
5  C0100000111 2001-05-17 299.00          1 I0000000595
6  C0100000111 2001-05-19   5.99          1 I0000000078
7  C0100000199 2001-08-20 229.00          1 I0000000783
8  C0100000199 2001-12-29 229.00          1 I0000000783
9  C0100000199 2001-06-28 139.00          1 I0000000537
10 C0100000343 2001-09-07  99.00          1 I0000000532

我想把它转换成这样的结构,

CardID、FirstPurchaseDate、LastPurchaseDate、NumberOrders、NumberSKUs、TotalAmounts

新表中的每一行 CardID 都是唯一的。我怎样才能做到这一点?

根据上面的表格,我预计会有这样的输出

> Ex
       CardID FirstPurchaseDate LastPurchaseDate NumberOrders NumberSKUs TotalAmounts
1 C0100000111        2001-02-20       2001-07-19            6          6       882.97
2 C0100000199        2001-06-28       2001-12-29            3          2       597.00
3 C0100000343        2001-09-07       2001-09-07            1          1        99.00

【问题讨论】:

标签: r dataframe


【解决方案1】:

我们可以使用summarisedplyr分组后的'CardID'

library(dplyr) 
df1 %>% 
    group_by(CardID) %>% 
    summarise(FirstPurchaseDate = first(Date),
              LastPurchaseDate = last(Date),
              NumberOrders = n(), 
              NumberSKUs= n_distinct(ItemCode),
              TotalAmount = sum(Amount) )

【讨论】:

    【解决方案2】:

    下面是data.table 版本:

    library(data.table)
    
    dt <- data.frame(
      CardID = c("C0100000111", "C0100000111", "C0100000111", "C0100000111", "C0100000111", "C0100000111", "C0100000199", "C0100000199", "C0100000199", "C0100000343"),
      Date = as.Date(c("2001-07-19", "2001-02-20", "2001-04-27", "2001-02-20", "2001-05-17", "2001-05-19", "2001-08-20", "2001-12-29", "2001-06-28", "2001-09-07")),
      Amount = c(449, 9.99, 49.99, 69, 299, 5.99, 229, 229, 139, 99),
      ItemNumber = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L),
      ItemCode = c("I0000000808", "I0000000622", "I0000000284", "I0000000488", "I0000000595", "I0000000078", "I0000000783", "I0000000783", "I0000000537", "I0000000532")
    )
    
    # Convert to data.table
    setDT(dt)
    
    dt[, .(
      FirstPurchaseDate = min(Date),
      LastPurchaseDate = max(Date),
      NumberOrders = .N,
      NumberSKUs = length(unique(ItemCode)),
      TotalAmounts = sum(Amount)
    ), by = CardID]
    

    结果:

            CardID FirstPurchaseDate LastPurchaseDate NumberOrders NumberSKUs TotalAmounts
    1: C0100000111        2001-02-20       2001-07-19            6          6       882.97
    2: C0100000199        2001-06-28       2001-12-29            3          2       597.00
    3: C0100000343        2001-09-07       2001-09-07            1          1        99.00
    

    编辑:Akrun 是第一个,所以去找他的答案吧!留下这个仅供data.table 参考。我应该开始使用dplyr 更多...

    【讨论】:

      猜你喜欢
      • 2021-06-15
      • 1970-01-01
      • 2021-09-27
      • 2021-11-25
      • 2021-12-06
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2017-08-04
      相关资源
      最近更新 更多