使用 R 中的 read.table 按日期分解百分位数数据答案

【问题标题】：Breaking down percentile data by date using read.table in R使用 R 中的 read.table 按日期分解百分位数数据
【发布时间】：2020-02-05 18:17:43
【问题描述】：

我有以下玩具数据集：

dt <- read.table(text = "
Date                    Model      Color    Value   Samples
1/29/2020 6:51:19 AM    Gold       Blue     0.5     500
1/29/2020 7:57:47 AM    Gold       Red      0.0     449
1/29/2020 3:39:04 PM    Silver     Blue     0.75    1320
1/29/2020 5:04:32 PM    Silver     Blue     1.5     103
1/29/2020 10:32:39 AM   Gold       Red      0.7     891
1/30/2020 1:02:12 AM    Gold       Blue     0.41    18103
1/30/2020 4:30:00 AM    Copper     Blue     0.83    564
1/30/2020 9:09:45 AM    Silver     Pink     1.17    173
1/30/2020 2:19:30 PM    Platinum   Brown    0.43    793
1/30/2020 4:43:32 PM    Platinum   Red      0.71    1763
1/30/2020 7:19:00 PM    Gold       Orange   1.92    503",
                 header = TRUE, stringsAsFactors = FALSE)

然后我拿这个data.table，生成一些百分位数据，如下：

qs = dt[Value > 0, .(Samples = sum(Samples),
                     '50th'    = quantile(Value, probs = c(0.50)),
                     '75th'    = quantile(Value, probs = c(0.75)),
                     '90th'    = quantile(Value, probs = c(0.90)), 
                     '99th'    = quantile(Value, probs = c(0.99))),
        by = .(Model, Color)]
setkey(qs, 'Model')

最后，我将结果输出到 .csv 文件：

#outputs to csv file

write.csv(qs, file = "outfile.csv")

问题：我将如何编写结果以便：

a) 结果按日期细分（即只取日期，例如 2020 年 1 月 30 日和 2020 年 1 月 31 日，不包括时间） b) 日期写成行

例如（注意：下面的值是玩具数据，而不是真正的计算...只是想显示“日期”列的表示方式）：

#       Model  Color Samples  50th   99th  99.9th  99.99th Date
# 1:   Copper   Blue     564 0.830 0.8300 0.83000 0.830000 01/29/2020
# 2:     Gold   Blue   18603 0.455 0.4991 0.49991 0.499991 01/29/2020
# 3:     Gold    Red     891 0.700 0.7000 0.70000 0.700000 01/29/2020
# 4:     Gold Orange     503 1.920 1.9200 1.92000 1.920000 01/29/2020
# 5: Platinum  Brown     793 0.430 0.4300 0.43000 0.430000 01/29/2020
# 6: Platinum    Red    1763 0.710 0.7100 0.71000 0.710000 01/29/2020
# 7:   Silver   Blue    1423 1.125 1.4925 1.49925 1.499925 01/29/2020
# 8:   Silver   Pink     173 1.170 1.1700 1.17000 1.170000 01/29/2020
# 9:   Copper   Blue     564 0.830 0.8300 0.83000 0.830000 01/30/2020
#10:     Gold   Blue   18603 0.455 0.4991 0.49991 0.499991 01/30/2020
#11:     Gold    Red     891 0.700 0.7000 0.70000 0.700000 01/30/2020
#12:     Gold Orange     503 1.920 1.9200 1.92000 1.920000 01/30/2020
#13: Platinum  Brown     793 0.430 0.4300 0.43000 0.430000 01/30/2020
#14: Platinum    Red    1763 0.710 0.7100 0.71000 0.710000 01/30/2020
#15:   Silver   Blue    1423 1.125 1.4925 1.49925 1.499925 01/30/2020
#16:   Silver   Pink     173 1.170 1.1700 1.17000 1.170000 01/30/2020

谢谢！

【问题讨论】：

标签： r dataframe data.table

【解决方案1】：

如果我们需要在原始数据集中创建列，请使用:=

library(dplyr)
library(lubridate)
setDT(dt)[Value > 0,  c("Samples", '50th', '75th', '90th', '99th') := 
    c(list(sum(Samples)), as.list(quantile(Value,
      probs = c(0.50, 0.75, 0.90, 0.99)))),
        .(Model, Color, DateNoTime = as.Date(mdy_hms(Date)) )]
dt
#                     Date    Model  Color Value Samples  50th   75th  90th   99th
# 1:  1/29/2020 6:51:19 AM     Gold   Blue  0.50     500 0.500 0.5000 0.500 0.5000
# 2:  1/29/2020 7:57:47 AM     Gold    Red  0.00     449    NA     NA    NA     NA
# 3:  1/29/2020 3:39:04 PM   Silver   Blue  0.75    1423 1.125 1.3125 1.425 1.4925
# 4:  1/29/2020 5:04:32 PM   Silver   Blue  1.50    1423 1.125 1.3125 1.425 1.4925
# 5: 1/29/2020 10:32:39 AM     Gold    Red  0.70     891 0.700 0.7000 0.700 0.7000
# 6:  1/30/2020 1:02:12 AM     Gold   Blue  0.41   18103 0.410 0.4100 0.410 0.4100
# 7:  1/30/2020 4:30:00 AM   Copper   Blue  0.83     564 0.830 0.8300 0.830 0.8300
# 8:  1/30/2020 9:09:45 AM   Silver   Pink  1.17     173 1.170 1.1700 1.170 1.1700
# 9:  1/30/2020 2:19:30 PM Platinum  Brown  0.43     793 0.430 0.4300 0.430 0.4300
#10:  1/30/2020 4:43:32 PM Platinum    Red  0.71    1763 0.710 0.7100 0.710 0.7100
#11:  1/30/2020 7:19:00 PM     Gold Orange  1.92     503 1.920 1.9200 1.920 1.9200

对于这些新列，这还将用NA 填充具有Value <= 0 的行。

但是，如果打算用汇总值填充所有行，则进行连接并通过在by 中包含“日期”部分来创建“qs”

qs <- setDT(dt)[Value > 0, .(Samples = sum(Samples),
                     '50th'    = quantile(Value, probs = c(0.50)),
                     '75th'    = quantile(Value, probs = c(0.75)),
                     '90th'    = quantile(Value, probs = c(0.90)), 
                     '99th'    = quantile(Value, probs = c(0.99))),
        by = .(Model, Color,
          DateNoTime = format(as.Date(mdy_hms(Date)), "%m/%d/%Y") )]



qs[dt, on = .(Model, Color)]

如果我们不想在 by 中包含“日期”并且只需要在输出中使用它

setDT(dt)[, DateNoTime := as.Date(mdy_hms(Date))
     ][Value > 0,  c("Samples", '50th', '75th', '90th', '99th') := 
    c(list(sum(Samples)), as.list(quantile(Value,
      probs = c(0.50, 0.75, 0.90, 0.99)))),
        .(Model, Color)]
dt
#                     Date    Model  Color Value Samples DateNoTime  50th   75th  90th   99th
# 1:  1/29/2020 6:51:19 AM     Gold   Blue  0.50   18603 2020-01-29 0.455 0.4775 0.491 0.4991
# 2:  1/29/2020 7:57:47 AM     Gold    Red  0.00     449 2020-01-29    NA     NA    NA     NA
# 3:  1/29/2020 3:39:04 PM   Silver   Blue  0.75    1423 2020-01-29 1.125 1.3125 1.425 1.4925
# 4:  1/29/2020 5:04:32 PM   Silver   Blue  1.50    1423 2020-01-29 1.125 1.3125 1.425 1.4925
# 5: 1/29/2020 10:32:39 AM     Gold    Red  0.70     891 2020-01-29 0.700 0.7000 0.700 0.7000
# 6:  1/30/2020 1:02:12 AM     Gold   Blue  0.41   18603 2020-01-30 0.455 0.4775 0.491 0.4991
# 7:  1/30/2020 4:30:00 AM   Copper   Blue  0.83     564 2020-01-30 0.830 0.8300 0.830 0.8300
# 8:  1/30/2020 9:09:45 AM   Silver   Pink  1.17     173 2020-01-30 1.170 1.1700 1.170 1.1700
# 9:  1/30/2020 2:19:30 PM Platinum  Brown  0.43     793 2020-01-30 0.430 0.4300 0.430 0.4300
#10:  1/30/2020 4:43:32 PM Platinum    Red  0.71    1763 2020-01-30 0.710 0.7100 0.710 0.7100
#11:  1/30/2020 7:19:00 PM     Gold Orange  1.92     503 2020-01-30 1.920 1.9200 1.920 1.9200

数据

dt <- structure(list(Date = c("1/29/2020 6:51:19 AM", "1/29/2020 7:57:47 AM", 
"1/29/2020 3:39:04 PM", "1/29/2020 5:04:32 PM", "1/29/2020 10:32:39 AM", 
"1/30/2020 1:02:12 AM", "1/30/2020 4:30:00 AM", "1/30/2020 9:09:45 AM", 
"1/30/2020 2:19:30 PM", "1/30/2020 4:43:32 PM", "1/30/2020 7:19:00 PM"
), Model = c("Gold", "Gold", "Silver", "Silver", "Gold", "Gold", 
"Copper", "Silver", "Platinum", "Platinum", "Gold"), Color = c("Blue", 
"Red", "Blue", "Blue", "Red", "Blue", "Blue", "Pink", "Brown", 
"Red", "Orange"), Value = c(0.5, 0, 0.75, 1.5, 0.7, 0.41, 0.83, 
1.17, 0.43, 0.71, 1.92), Samples = c(500L, 449L, 1320L, 103L, 
891L, 18103L, 564L, 173L, 793L, 1763L, 503L)), 
class = "data.frame", row.names = c(NA, 
-11L))

【讨论】：

参考玩具数据集，其目的是仅计算两个日期的百分位数数据（不包括这两个日期的无数次）。
@equanimity 我更新了帖子。请检查这是否解决了问题
当我创建“qs”数据表并使用您的代码时，我看到以下错误：“[.data.table(qs, dt, on = .(Model, Color)) 中的错误：列(s) [Model,Color] 未在 x"中找到"
@equanimity 我假设qs 是data.table
那是我的一个错误。我打算检查您的回答作为答案（我已经完成了）。我也投了赞成票:)。