【问题标题】:Data manipulation and pivoting数据操作和数据透视
【发布时间】:2021-04-08 10:08:47
【问题描述】:

我有一个交易数据集,其中与客户相关的每笔交易都显示在单独的行中,如下所示:

Customer_ID     Transaction_Date    Amount
Cust_1           20-Dec-2020          100
Cust_1           28-Dec-2020          800
Cust_1           05-Jan-2021          300
Cust_2           10-Jan-2021          200
Cust_2           08-Feb-2021          300
Cust_3           15-Feb-2021          500

我尝试将日期转换为名称为“1st_Trans_Date”、“2nd_Trans_Date”等的不同列,R 给了我一个稀疏矩阵,每个唯一日期分配给一列,从而生成 1000 多列。

我希望通过一些计算在不同的列中重新构造这些数据,如下所示:

Customer_ID    1st_Trans_Date    2nd_Trans_Date    3rd_Trans_Date    Total_Trans    Total_Amt    Avg_Amt
Cust_1         20-Dec-2020       28-Dec-2020       05-Jan-2021         3             1200         400
Cust_2         10-Jan-2021       08-Feb-2021                           2              500         250
Cust_3         15-Feb-2021                                             1              500         500 
 

【问题讨论】:

标签: r dataframe pivot pivot-table data-manipulation


【解决方案1】:

您可以计算每个客户的交易次数,并将数据重塑为宽格式。

library(dplyr)
library(tidyr)

df %>%
  group_by(Customer_ID) %>%
  mutate(row = row_number(), 
         Total_Trans = n()) %>%
  ungroup %>%
  select(-Amount) %>%
  pivot_wider(names_from = row, values_from = Transaction_Date, 
              names_prefix = 'Trans_Date')

# Customer_ID Total_Trans Trans_Date1 Trans_Date2 Trans_Date3
#  <chr>             <int> <chr>       <chr>       <chr>      
#1 Cust_1                3 20-Dec-2020 28-Dec-2020 05-Jan-2021
#2 Cust_2                2 10-Jan-2021 08-Feb-2021 NA         
#3 Cust_3                1 15-Feb-2021 NA          NA         

数据

df <- structure(list(Customer_ID = c("Cust_1", "Cust_1", "Cust_1", 
"Cust_2", "Cust_2", "Cust_3"), Transaction_Date = c("20-Dec-2020", 
"28-Dec-2020", "05-Jan-2021", "10-Jan-2021", "08-Feb-2021", "15-Feb-2021"
), Amount = c(100L, 800L, 300L, 200L, 300L, 500L)), 
class = "data.frame", row.names = c(NA, -6L))

【讨论】:

  • 感谢您的帮助罗纳克。将尝试这个并检查结果。
【解决方案2】:

这是data.table 方法

library(data.table)
# Sample data -----
DT <- fread("Customer_ID     Transaction_Date    Amount
Cust_1           20-Dec-2020          100
Cust_1           28-Dec-2020          800
Cust_1           05-Jan-2021          300
Cust_2           10-Jan-2021          200
Cust_2           08-Feb-2021          300
Cust_3           15-Feb-2021          500")
# Sumarise by customer
ans <- DT[, .(paste0(Transaction_Date, collapse = ";"),
              Total_Trans = .N), 
          by = .(Customer_ID)]
# Split transactioen dates
ans[, paste0("Trans_Date", 1:length(tstrsplit(ans$V1, ";"))) := tstrsplit(V1, ";")][,V1 := NULL]
#    Customer_ID Total_Trans Trans_Date1 Trans_Date2 Trans_Date3
# 1:      Cust_1           3 20-Dec-2020 28-Dec-2020 05-Jan-2021
# 2:      Cust_2           2 10-Jan-2021 08-Feb-2021        <NA>
# 3:      Cust_3           1 15-Feb-2021        <NA>        <NA>

【讨论】:

    【解决方案3】:

    为了完整起见,实际生成 OP 要求的列的解决方案:

    
    
    data <- read.table( text=
    "Customer_ID     Transaction_Date    Amount
    Cust_1           20-Dec-2020          100
    Cust_1           28-Dec-2020          800
    Cust_1           05-Jan-2021          300
    Cust_2           10-Jan-2021          200
    Cust_2           08-Feb-2021          300
    Cust_3           15-Feb-2021          500
    ", header=TRUE)
    
    library(toOrdinal)
    
    data %<>% group_by(Customer_ID) %>%
        mutate(
            TxNo = paste0( sapply(seq(n()),toOrdinal),"_Trans_Date" ),
            Total_Trans = n(),
            Total_Amt = sum(Amount),
            Avg_Amt = mean(Amount)
        )
    
    data %>% pivot_wider(
                 id_cols=c("Customer_ID","Total_Trans","Total_Amt","Avg_Amt"),
                 names_from=TxNo,
                 values_from=Transaction_Date ) %>% print.data.frame
    
    

    输出:

      Customer_ID Total_Trans Total_Amt Avg_Amt 1st_Trans_Date 2nd_Trans_Date 3rd_Trans_Date
    1      Cust_1           3      1200     400    20-Dec-2020    28-Dec-2020    05-Jan-2021
    2      Cust_2           2       500     250    10-Jan-2021    08-Feb-2021           <NA>
    3      Cust_3           1       500     500    15-Feb-2021           <NA>           <NA>
    

    【讨论】:

    • 非常感谢您的指导。这确实帮助我清楚地了解了 R 中的数据操作是如何工作的,并且有助于解决我遇到的问题。
    猜你喜欢
    • 2021-04-26
    • 2020-02-05
    • 1970-01-01
    • 2018-04-10
    • 1970-01-01
    • 2016-10-18
    • 2016-08-08
    • 2021-08-17
    • 1970-01-01
    相关资源
    最近更新 更多