如何使用 R 以摘要格式构造数据答案

【问题标题】：How to structure data in summary format using R如何使用 R 以摘要格式构造数据
【发布时间】：2021-03-17 13:06:10
【问题描述】：

我在 R 中创建了下面提到的数据框。

我的_DF

ID        Date                  Type       Remark      Price
PRT-11    2020-12-01 10:12:14   SS_RT      AT_1_O      1000
PRT-11    2020-12-01 10:12:14   SS_RT      AT_1_O      1200
PRT-11    2020-12-01 10:12:14   SS_RT      AT_1_O      1600
PRT-11    2020-12-01 10:12:14   SS_RG      AT_1_A      1600
PRT-11    2020-12-01 10:12:14   SS_RG      AT_1_B      1600
PRT-11    2020-12-01 10:12:14   SS_RG      AT_1_C      1000
PRT-11    2020-12-01 10:12:14   SS_RT      AT_1_Y      1200
PRT-11    2020-12-07 10:12:14   SS_RT      AT_1_U      1600
PRT-11    2020-12-07 10:12:14   SS_RI      AT_1_M      1600
PRT-11    2020-12-07 10:12:14   SS_RO      AT_1_P      1600

我想把上面提到的DF转换成下面的结构Dataframe，并转换成HTML格式，可以使用mailR库发送邮件。

我遵循以下条件。

如果Type 等于SS_RT 那么它就是Type - A
如果Type 不是SS_RT，那么它就是Type - B
如果Type 等于SS_RT 而Remark 等于AT_1_O 那么它就是Type - A1
如果Type 等于SS_RT 而Remark 不是AT_1_O 那么它是Type - A2
Type - A1 (Excl) 的公式是Type - A1 除以Type - A1 和Type - A2 之和
Type - A1 (Excl) 的公式是Type - A2 除以Type - A1 和Type - A2 之和

其余所有 %age 公式都非常简单，分母为 Total。

在数据框中，可能没有特定日期的条目。为此，我们需要确保在所有可用日期中我们需要获取最小和最大日期，并确保对于不可用的日期，我们在麻烦计数和总和列中显示值为 0。

我已将日期合并为两行，第一行用于计数，第二行用于按逻辑定义的总和组。

【问题讨论】：

嗨。你能确认一下第一个总价应该是4000吗？我的计算表明它应该是 5000（例如，1000 + 1200 + 1600 + 1200）。此外，您的预期输出包括每个月的日期。提供的数据集只有 6 天。我觉得已经有完成数据集的答案，但您应该提供详细信息 - 日期的顺序是什么？
@Cole：你的计算是正确的，应该是5000。另外，考虑最小和最大日期，日期的顺序应该按降序排列。如果在数据集中最小日期是 12 月 1 日，最大日期是 12 月 10 日，并且数据集中不存在 2、4、6 的条目，那么我们需要为此创建一个空白条目。
感谢您的澄清。我不遵循空白条目。为什么不创建 12 月 8 日？
@Cole：让用户知道在特定日期没有条目，以便他们可以进行相应的检查。

标签： r dataframe dplyr html-table tidyverse

【解决方案1】：

这是一项非常忙碌的工作。它基本上就是你在基础 R 中所说的一切，按data.table 中的日期分组。请注意，我不确定 OP 中的价格是否准确，或者我是否还有其他问题。

dt[, Date := as.POSIXct(Date, "UTC")]
dt[,
   {
     t_ss_rt = Type == 'SS_RT'
     Type_A = sum(t_ss_rt)
     Type_B = .N - Type_A
     
     tot_Price = sum(Price)
     Type_A_price = sum(Price[t_ss_rt])
     Type_B_price = tot_Price - Type_A_price
     
     rm_ss_rt = t_ss_rt & Remark == 'AT_1_O'
     Type_A1 = sum(rm_ss_rt)
     Type_A2 = Type_A - Type_A1
     
     tot_An_Price = sum(Price[t_ss_rt])
     Type_A1_Price = sum(Price[rm_ss_rt])
     Type_A2_Price = tot_An_Price - Type_A1_Price
     
     Type_A1_Excl = Type_A1 / (Type_A1 + Type_A2)
     Type_A2_Excl = Type_A2 / (Type_A1 + Type_A2)
     
     .(c(Type_A, Type_A_price), c(Type_A / .N, Type_A_price / tot_Price),
       c(Type_A1, Type_A1_Price), c(Type_A1 / .N, Type_A1_Price / tot_Price),
       c(Type_A2, Type_A2_Price), c(Type_A2 / .N, Type_A2_Price / tot_Price),
       c(Type_B, Type_B_price), c(Type_B / .N, NA_real_), c(.N, tot_Price),
       c(Type_A1_Excl, Type_A1_Price / (tot_An_Price)), c(Type_A2_Excl, Type_A2_Price / tot_An_Price))
   },
   by = .(Date)]

对于这些结果：

                  Date   V1        V2   V3        V4   V5        V6   V7        V8   V9  V10  V11
1: 2020-12-01 10:12:14    4 0.5714286    3 0.4285714    1 0.1428571    3 0.4285714    7 0.75 0.25
2: 2020-12-01 10:12:14 5000 0.5434783 3800 0.4130435 1200 0.1304348 4200        NA 9200 0.76 0.24
3: 2020-12-07 10:12:14    1 0.3333333    0 0.0000000    1 0.3333333    2 0.6666667    3 0.00 1.00
4: 2020-12-07 10:12:14 1600 0.3333333    0 0.0000000 1600 0.3333333 3200        NA 4800 0.00 1.00

数据来源：

library(data.table)

dt = data.table::fread(
"ID   ,     Date      ,            Type ,      Remark,      Price
PRT-11,    2020-12-01 10:12:14,   SS_RT,      AT_1_O   ,   1000
PRT-11,    2020-12-01 10:12:14,   SS_RT ,     AT_1_O  ,    1200
PRT-11,    2020-12-01 10:12:14,   SS_RT  ,    AT_1_O ,     1600
PRT-11,    2020-12-01 10:12:14,   SS_RG   ,   AT_1_A,      1600
PRT-11,    2020-12-01 10:12:14,   SS_RG    ,  AT_1_B     , 1600
PRT-11,    2020-12-01 10:12:14,   SS_RG     , AT_1_C    ,  1000
PRT-11,    2020-12-01 10:12:14,   SS_RT,      AT_1_Y   ,   1200
PRT-11,    2020-12-07 10:12:14,   SS_RT ,     AT_1_U  ,    1600
PRT-11,    2020-12-07 10:12:14,   SS_RI  ,    AT_1_M ,     1600
PRT-11,    2020-12-07 10:12:14,   SS_RO   ,   AT_1_P,      1600")

【讨论】：

【解决方案2】：

这是一个data.table 解决方案。我试图避免手动计算，并采用基于长到宽转换的解决方案。这是我的解决方案，后面有详细的说明：

library(lubridate)
library(data.table)

dt <- setDT(dt)
dt[,Date := date(Date)]
dt[,type := fifelse(Type == "SS_RT",fifelse(Remark == "AT_1_O","A1","A2"),"B")]
## transform to wide
df2 <- rbind(dcast(data = dt,Date~type ,value.var = "Price",fill = 0)[,linetype := "count"],
             dcast(data = dt,Date~type ,value.var = "Price",fill = 0,fun.aggregate = sum)[,linetype := "value"])
## A and tot
df2[,tot := rowSums(.SD),.SDcols = c("A1","A2","B")]
df2[,A := A1+A2]
## create pc
cols <- c("A","A1","A2","B")
df2[,paste0(cols,"_pc") := lapply(.SD,function(x) round(x/tot*100) ),.SDcols = cols]
cols <- c("A1","A2")
df2[,paste0(cols,"_exc") := lapply(.SD,function(x) round(x/(A1+A2)*100) ),.SDcols = cols]
## add missing dates
df2 <- merge(CJ(Date = seq(min(dt$Date),max(dt$Date),1),linetype = c("count","value")),
             df2,all = T,by = c("Date","linetype"))

df2[is.na(df2)] <- 0
df2[,linetype := NULL]
df2

          Date   A1   A2    B  tot    A A_pc A1_pc A2_pc B_pc A1_exc A2_exc
 1: 2020-12-01    3    1    3    7    4   57    43    14   43     75     25
 2: 2020-12-01 3800 1200 4200 9200 5000   54    41    13   46     76     24
 3: 2020-12-02    0    0    0    0    0    0     0     0    0      0      0
 4: 2020-12-02    0    0    0    0    0    0     0     0    0      0      0
 5: 2020-12-03    0    0    0    0    0    0     0     0    0      0      0
 6: 2020-12-03    0    0    0    0    0    0     0     0    0      0      0
 7: 2020-12-04    0    0    0    0    0    0     0     0    0      0      0
 8: 2020-12-04    0    0    0    0    0    0     0     0    0      0      0
 9: 2020-12-05    0    0    0    0    0    0     0     0    0      0      0
10: 2020-12-05    0    0    0    0    0    0     0     0    0      0      0
11: 2020-12-06    0    0    0    0    0    0     0     0    0      0      0
12: 2020-12-06    0    0    0    0    0    0     0     0    0      0      0
13: 2020-12-07    0    1    2    3    1   33     0    33   67      0    100
14: 2020-12-07    0 1600 3200 4800 1600   33     0    33   67      0    100

所以第一步是我按照您的规则创建type 变量：

dt[,Date := date(Date)]
dt[,type := fifelse(Type == "SS_RT",fifelse(Remark == "AT_1_O","A1","A2"),"B")]

我们知道A 就是A1 + A2。它允许我将表格转换为宽格式。我做了两次：一次计算，一次计算每种类型的总和：

dcast(data = dt,Date ~ type ,value.var = "Price",fill = 0)

         Date A1 A2 B 
1: 2020-12-01  3  1 3    
2: 2020-12-07  0  1 2

这里我计算每种类型的出现次数，因为它使用默认聚合：lenght。如果我使用sum 作为聚合函数：

dcast(data = dt,Date~type ,value.var = "Price",fill = 0,fun.aggregate = sum)

         Date   A1   A2    B
1: 2020-12-01 3800 1200 4200
2: 2020-12-07    0 1600 3200

我添加了linetype 变量，这将帮助我添加缺失的日期（我使用它来保持每个日期两行）。

我绑定两者，我得到：

         Date   A1   A2    B linetype
1: 2020-12-01    3    1    3    count
2: 2020-12-07    0    1    2    count
3: 2020-12-01 3800 1200 4200    value
4: 2020-12-07    0 1600 3200    value

然后我计算A 和总数：

df2[,tot := rowSums(.SD),.SDcols = c("A1","A2","B")]
df2[,A := A1+A2]

然后，我使用 lapply 和要转换的列的向量计算百分比 (_pc) 和 Excl 变量（为简单起见，我将其命名为 _exc）。我使用fifelse 来避免除以0：

cols <- c("A","A1","A2","B")
df2[,paste0(cols,"_pc") := lapply(.SD,function(x) round(x/tot*100) ),.SDcols = cols]
cols <- c("A1","A2")
df2[,paste0(cols,"_exc") := lapply(.SD,function(x) round(x/(A1+A2)*100) ),.SDcols = cols]


         Date   A1   A2    B linetype  tot    A A_pc A1_pc A2_pc B_pc A1_exc A2_exc
1: 2020-12-01    3    1    3    count    7    4   57    43    14   43     75     25
2: 2020-12-01 3800 1200 4200    value 9200 5000   54    41    13   46     76     24
3: 2020-12-07    0    1    2    count    3    1   33     0    33   67      0    100
4: 2020-12-07    0 1600 3200    value 4800 1600   33     0    33   67      0    100

然后我通过合并linetype 和Date 的所有组合并保留所有行来添加缺失的日期。我使用CJ 函数创建了一个data.table，其中包含两个变量的所有组合：

CJ(Date = seq(min(dt$Date),max(dt$Date),1),linetype = c("count","value"))
          Date linetype
 1: 2020-12-01    count
 2: 2020-12-01    value
 3: 2020-12-02    count
 4: 2020-12-02    value
 5: 2020-12-03    count
 6: 2020-12-03    value
 7: 2020-12-04    count
 8: 2020-12-04    value
 9: 2020-12-05    count
10: 2020-12-05    value
11: 2020-12-06    count
12: 2020-12-06    value
13: 2020-12-07    count
14: 2020-12-07    value

然后用 0 替换缺失值并抑制 linetype 变量。

然后您可以使用setcolorder 重新排序列，并使用kabbleExtra（请参阅here）生成您的html 输出。

您可以对dplyr 执行相同操作，使用pivot_wider 转换为宽，mutate_all 代替lapply(.SD,...) 进行计算，expand.grid 代替CJ 生成缺失表日期。

【讨论】：

谢谢，如何在 HTML 输出中合并相同的日期？
您可以使用kabbleExtra，将您的表格转换为html：cran.r-project.org/web/packages/kableExtra/vignettes/…。它有很多选项，用于着色，设置样式等
出现错误Error in :=(Date, date(Date)) : Check that is.data.table(DT) == TRUE. Otherwise, := and :=(...) are defined for use in j, once only and in particular ways. See help(":=").
是的，dt 必须是data.table。 dt <- setDT(dt)
在df2[,paste0(cols,"_pc") := lapply(.SD,function(x) round(x/tot*100) ),.SDcols = cols] 遇到一些随机错误，错误是Error in FUN(X[[i]], ...) : object 'tot' not found