【问题标题】:Breaking down percentile data by date using read.table in R使用 R 中的 read.table 按日期分解百分位数数据
【发布时间】:2020-02-05 18:17:43
【问题描述】:

我有以下玩具数据集:

dt <- read.table(text = "
Date                    Model      Color    Value   Samples
1/29/2020 6:51:19 AM    Gold       Blue     0.5     500
1/29/2020 7:57:47 AM    Gold       Red      0.0     449
1/29/2020 3:39:04 PM    Silver     Blue     0.75    1320
1/29/2020 5:04:32 PM    Silver     Blue     1.5     103
1/29/2020 10:32:39 AM   Gold       Red      0.7     891
1/30/2020 1:02:12 AM    Gold       Blue     0.41    18103
1/30/2020 4:30:00 AM    Copper     Blue     0.83    564
1/30/2020 9:09:45 AM    Silver     Pink     1.17    173
1/30/2020 2:19:30 PM    Platinum   Brown    0.43    793
1/30/2020 4:43:32 PM    Platinum   Red      0.71    1763
1/30/2020 7:19:00 PM    Gold       Orange   1.92    503",
                 header = TRUE, stringsAsFactors = FALSE)

然后我拿这个data.table,生成一些百分位数据,如下:

qs = dt[Value > 0, .(Samples = sum(Samples),
                     '50th'    = quantile(Value, probs = c(0.50)),
                     '75th'    = quantile(Value, probs = c(0.75)),
                     '90th'    = quantile(Value, probs = c(0.90)), 
                     '99th'    = quantile(Value, probs = c(0.99))),
        by = .(Model, Color)]
setkey(qs, 'Model')

最后,我将结果输出到 .csv 文件:

#outputs to csv file

write.csv(qs, file = "outfile.csv")

问题:我将如何编写结果以便:

a) 结果按日期细分(即只取日期,例如 2020 年 1 月 30 日和 2020 年 1 月 31 日,不包括时间) b) 日期写成行

例如(注意:下面的值是玩具数据,而不是真正的计算...只是想显示“日期”列的表示方式):

#       Model  Color Samples  50th   99th  99.9th  99.99th Date
# 1:   Copper   Blue     564 0.830 0.8300 0.83000 0.830000 01/29/2020
# 2:     Gold   Blue   18603 0.455 0.4991 0.49991 0.499991 01/29/2020
# 3:     Gold    Red     891 0.700 0.7000 0.70000 0.700000 01/29/2020
# 4:     Gold Orange     503 1.920 1.9200 1.92000 1.920000 01/29/2020
# 5: Platinum  Brown     793 0.430 0.4300 0.43000 0.430000 01/29/2020
# 6: Platinum    Red    1763 0.710 0.7100 0.71000 0.710000 01/29/2020
# 7:   Silver   Blue    1423 1.125 1.4925 1.49925 1.499925 01/29/2020
# 8:   Silver   Pink     173 1.170 1.1700 1.17000 1.170000 01/29/2020
# 9:   Copper   Blue     564 0.830 0.8300 0.83000 0.830000 01/30/2020
#10:     Gold   Blue   18603 0.455 0.4991 0.49991 0.499991 01/30/2020
#11:     Gold    Red     891 0.700 0.7000 0.70000 0.700000 01/30/2020
#12:     Gold Orange     503 1.920 1.9200 1.92000 1.920000 01/30/2020
#13: Platinum  Brown     793 0.430 0.4300 0.43000 0.430000 01/30/2020
#14: Platinum    Red    1763 0.710 0.7100 0.71000 0.710000 01/30/2020
#15:   Silver   Blue    1423 1.125 1.4925 1.49925 1.499925 01/30/2020
#16:   Silver   Pink     173 1.170 1.1700 1.17000 1.170000 01/30/2020

谢谢!

【问题讨论】:

    标签: r dataframe data.table


    【解决方案1】:

    如果我们需要在原始数据集中创建列,请使用:=

    library(dplyr)
    library(lubridate)
    setDT(dt)[Value > 0,  c("Samples", '50th', '75th', '90th', '99th') := 
        c(list(sum(Samples)), as.list(quantile(Value,
          probs = c(0.50, 0.75, 0.90, 0.99)))),
            .(Model, Color, DateNoTime = as.Date(mdy_hms(Date)) )]
    dt
    #                     Date    Model  Color Value Samples  50th   75th  90th   99th
    # 1:  1/29/2020 6:51:19 AM     Gold   Blue  0.50     500 0.500 0.5000 0.500 0.5000
    # 2:  1/29/2020 7:57:47 AM     Gold    Red  0.00     449    NA     NA    NA     NA
    # 3:  1/29/2020 3:39:04 PM   Silver   Blue  0.75    1423 1.125 1.3125 1.425 1.4925
    # 4:  1/29/2020 5:04:32 PM   Silver   Blue  1.50    1423 1.125 1.3125 1.425 1.4925
    # 5: 1/29/2020 10:32:39 AM     Gold    Red  0.70     891 0.700 0.7000 0.700 0.7000
    # 6:  1/30/2020 1:02:12 AM     Gold   Blue  0.41   18103 0.410 0.4100 0.410 0.4100
    # 7:  1/30/2020 4:30:00 AM   Copper   Blue  0.83     564 0.830 0.8300 0.830 0.8300
    # 8:  1/30/2020 9:09:45 AM   Silver   Pink  1.17     173 1.170 1.1700 1.170 1.1700
    # 9:  1/30/2020 2:19:30 PM Platinum  Brown  0.43     793 0.430 0.4300 0.430 0.4300
    #10:  1/30/2020 4:43:32 PM Platinum    Red  0.71    1763 0.710 0.7100 0.710 0.7100
    #11:  1/30/2020 7:19:00 PM     Gold Orange  1.92     503 1.920 1.9200 1.920 1.9200
    

    对于这些新列,这还将用NA 填充具有Value &lt;= 0 的行。


    但是,如果打算用汇总值填充所有行,则进行连接并通过在by 中包含“日期”部分来创建“qs”

    qs <- setDT(dt)[Value > 0, .(Samples = sum(Samples),
                         '50th'    = quantile(Value, probs = c(0.50)),
                         '75th'    = quantile(Value, probs = c(0.75)),
                         '90th'    = quantile(Value, probs = c(0.90)), 
                         '99th'    = quantile(Value, probs = c(0.99))),
            by = .(Model, Color,
              DateNoTime = format(as.Date(mdy_hms(Date)), "%m/%d/%Y") )]
    
    
    
    qs[dt, on = .(Model, Color)]
    

    如果我们不想在 by 中包含“日期”并且只需要在输出中使用它

    setDT(dt)[, DateNoTime := as.Date(mdy_hms(Date))
         ][Value > 0,  c("Samples", '50th', '75th', '90th', '99th') := 
        c(list(sum(Samples)), as.list(quantile(Value,
          probs = c(0.50, 0.75, 0.90, 0.99)))),
            .(Model, Color)]
    dt
    #                     Date    Model  Color Value Samples DateNoTime  50th   75th  90th   99th
    # 1:  1/29/2020 6:51:19 AM     Gold   Blue  0.50   18603 2020-01-29 0.455 0.4775 0.491 0.4991
    # 2:  1/29/2020 7:57:47 AM     Gold    Red  0.00     449 2020-01-29    NA     NA    NA     NA
    # 3:  1/29/2020 3:39:04 PM   Silver   Blue  0.75    1423 2020-01-29 1.125 1.3125 1.425 1.4925
    # 4:  1/29/2020 5:04:32 PM   Silver   Blue  1.50    1423 2020-01-29 1.125 1.3125 1.425 1.4925
    # 5: 1/29/2020 10:32:39 AM     Gold    Red  0.70     891 2020-01-29 0.700 0.7000 0.700 0.7000
    # 6:  1/30/2020 1:02:12 AM     Gold   Blue  0.41   18603 2020-01-30 0.455 0.4775 0.491 0.4991
    # 7:  1/30/2020 4:30:00 AM   Copper   Blue  0.83     564 2020-01-30 0.830 0.8300 0.830 0.8300
    # 8:  1/30/2020 9:09:45 AM   Silver   Pink  1.17     173 2020-01-30 1.170 1.1700 1.170 1.1700
    # 9:  1/30/2020 2:19:30 PM Platinum  Brown  0.43     793 2020-01-30 0.430 0.4300 0.430 0.4300
    #10:  1/30/2020 4:43:32 PM Platinum    Red  0.71    1763 2020-01-30 0.710 0.7100 0.710 0.7100
    #11:  1/30/2020 7:19:00 PM     Gold Orange  1.92     503 2020-01-30 1.920 1.9200 1.920 1.9200
    

    数据

    dt <- structure(list(Date = c("1/29/2020 6:51:19 AM", "1/29/2020 7:57:47 AM", 
    "1/29/2020 3:39:04 PM", "1/29/2020 5:04:32 PM", "1/29/2020 10:32:39 AM", 
    "1/30/2020 1:02:12 AM", "1/30/2020 4:30:00 AM", "1/30/2020 9:09:45 AM", 
    "1/30/2020 2:19:30 PM", "1/30/2020 4:43:32 PM", "1/30/2020 7:19:00 PM"
    ), Model = c("Gold", "Gold", "Silver", "Silver", "Gold", "Gold", 
    "Copper", "Silver", "Platinum", "Platinum", "Gold"), Color = c("Blue", 
    "Red", "Blue", "Blue", "Red", "Blue", "Blue", "Pink", "Brown", 
    "Red", "Orange"), Value = c(0.5, 0, 0.75, 1.5, 0.7, 0.41, 0.83, 
    1.17, 0.43, 0.71, 1.92), Samples = c(500L, 449L, 1320L, 103L, 
    891L, 18103L, 564L, 173L, 793L, 1763L, 503L)), 
    class = "data.frame", row.names = c(NA, 
    -11L))
    

    【讨论】:

    • 参考玩具数据集,其目的是仅计算两个日期的百分位数数据(不包括这两个日期的无数次)。
    • @equanimity 我更新了帖子。请检查这是否解决了问题
    • 当我创建“qs”数据表并使用您的代码时,我看到以下错误:“[.data.table(qs, dt, on = .(Model, Color)) 中的错误:列(s) [Model,Color] 未在 x"中找到"
    • @equanimity 我假设qs 是data.table
    • 那是我的一个错误。我打算检查您的回答作为答案(我已经完成了)。我也投了赞成票:)。
    猜你喜欢
    • 2013-10-30
    • 2017-08-31
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2022-01-26
    • 2018-01-26
    相关资源
    最近更新 更多