【问题标题】:For Loop with Subset in RR中带有子集的For循环
【发布时间】:2017-12-30 10:00:57
【问题描述】:

我在 csv 文件中有以下数据:

Date        Model      Color    Value   Samples
6/19/2017   Gold       Blue     0.5     500
6/19/2017   Gold       Red      0.0     449
6/19/2017   Silver     Blue     0.75    1320
6/19/2017   Silver     Blue     1.5     103
6/19/2017   Gold       Red      0.7     891
6/19/2017   Gold       Blue     0.41    18103
6/19/2017   Copper     Blue     0.83    564
6/19/2017   Silver     Pink     1.17    173
6/19/2017   Platinum   Brown    0.43    793
6/19/2017   Platinum   Red      0.71    1763
6/19/2017   Gold       Orange   1.92    503

我使用fread函数创建data.table:

library(dplyr)
library(data.table)

df <- fread("test_data.csv", 
                 header = TRUE,
                 fill = TRUE,
                 sep = ",")

然后我按Model对数据进行子集化,如下:

df_subset <- subset(df, df$Model=='Gold' & df$Value > 0)

然后,我根据Color 变量创建一些百分位数,如下所示:

df_subset[, .(Samples = sum(Samples),
    '50th'    = quantile(AvgValue, probs = c(0.50)),
    '99th'    = quantile(AvgValue, probs = c(0.99)),
    '99.9th'  = quantile(AvgValue, probs = c(0.999)), 
    '99.99th' = quantile(AvgValue, probs = c(0.9999))),
by = Color]

它给出以下输出:

    Color Samples  50th   99th  99.9th  99.99th
1:   Blue   18603 0.455 0.4991 0.49991 0.499991
2:    Red    1340 0.975 1.2445 1.24945 1.249945
3: Orange     503 1.920 1.9200 1.92000 1.920000

我正在尝试遍历 Model 值列表并为每个 Model 值输出相关的百分位值。

我尝试了以下方法(但失败了):

models <- unique(df$Model)

for (model in models){

  df$model[, .(Samples = sum(Samples),
                '50th'    = quantile(Value, probs = c(0.50)),
                '99th'    = quantile(Value, probs = c(0.99)),
                '99.9th'  = quantile(Value, probs = c(0.999)), 
                '99.99th' = quantile(Value, probs = c(0.9999))),
            by = Color]
}

错误信息是:

Error in .(Samples = sum(Samples), `50th` = quantile(Value, probs = c(0.5)),  :  could not find function "."

【问题讨论】:

  • dplyr 包:group_bymutate
  • 什么是AvgValue

标签: r for-loop dataframe data.table dplyr


【解决方案1】:

fread 创建一个 data.table 对象而不是一个数据框,所以我建议坚持使用 data.table 语法,不要将它与 dplyr 混合。也不需要for 循环,我们可以使用by 参数中的两个变量列表在一行代码中循环模型和颜色:

qs = df[Value > 0, .(Samples = sum(Samples),
              '50th'    = quantile(Value, probs = c(0.50)),
              '99th'    = quantile(Value, probs = c(0.99)),
              '99.9th'  = quantile(Value, probs = c(0.999)), 
              '99.99th' = quantile(Value, probs = c(0.9999))),
          by = .(Model, Color)]
setkey(qs, 'Model')

#       Model  Color Samples  50th   99th  99.9th  99.99th
# 1:   Copper   Blue     564 0.830 0.8300 0.83000 0.830000
# 2:     Gold   Blue   18603 0.455 0.4991 0.49991 0.499991
# 3:     Gold    Red     891 0.700 0.7000 0.70000 0.700000
# 4:     Gold Orange     503 1.920 1.9200 1.92000 1.920000
# 5: Platinum  Brown     793 0.430 0.4300 0.43000 0.430000
# 6: Platinum    Red    1763 0.710 0.7100 0.71000 0.710000
# 7:   Silver   Blue    1423 1.125 1.4925 1.49925 1.499925
# 8:   Silver   Pink     173 1.170 1.1700 1.17000 1.170000

【讨论】:

    【解决方案2】:

    使用您的定义,您可以尝试以下操作:

    library(data.table)
    df<-fread("~/theData.csv")
    df$Value<-as.numeric(df$Value)
    result<-data.frame()
    for (i in seq_along(unique(df$Model))){
      temp <- subset(df, df$Model==unique(df$Model)[i] & df$Value > 0)
      temp<-temp[, .(Samples = sum(Samples),
      '50th'    = quantile(Value, probs = c(0.50)),
      '99th'    = quantile(Value, probs = c(0.99)),
      '99.9th'  = quantile(Value, probs = c(0.999)), 
      '99.99th' = quantile(Value, probs = c(0.9999))),
       by = Color]
      temp$model<-unique(df$Model)[i]
      result<-rbind(result, temp)
    }
    rm(temp)
    

    【讨论】:

      【解决方案3】:

      这可能会解决您的问题

      library(dplyr)
      
      df [,-1] %>% filter(Value > 0) %>% group_by(Model, Color) %>% 
              do(data.frame(t(quantile(.$Value, probs = c(0.50, 0.99, 0.999, 0.9999))))) 
      

      关于您在 cmets 中的问题,关于如何连接样本总和:您可以使用aggregate;我不使用dplyr::summarise 的原因是我需要在应用do 后重新开始管道,这没有意义。

      data.frame(df %>% filter(Value > 0) %>% select(-Date) %>% group_by(Model, Color) %>% 
                    do(data.frame(t(quantile(.$Value, probs = c(0.50, 0.99, 0.999, 0.9999))))),
                 aggregate(Samples ~ Color+Model, df, sum)["Samples"])
      
      #      Model  Color  X50.   X99.  X99.9.  X99.99. Samples 
      # 1   Copper   Blue 0.830 0.8300 0.83000 0.830000     564 
      # 2     Gold   Blue 0.455 0.4991 0.49991 0.499991   18603 
      # 3     Gold Orange 1.920 1.9200 1.92000 1.920000     503 
      # 4     Gold    Red 0.700 0.7000 0.70000 0.700000    1340 
      # 5 Platinum  Brown 0.430 0.4300 0.43000 0.430000     793 
      # 6 Platinum    Red 0.710 0.7100 0.71000 0.710000    1763 
      # 7   Silver   Blue 1.125 1.4925 1.49925 1.499925    1423 
      # 8   Silver   Pink 1.170 1.1700 1.17000 1.170000     173
      

      数据:

      df <- structure(list(Date = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 
      1L, 1L, 1L, 1L), .Label = "6/19/2017", class = "factor"), Model = structure(c(2L, 
      2L, 4L, 4L, 2L, 2L, 1L, 4L, 3L, 3L, 2L), .Label = c("Copper", 
      "Gold", "Platinum", "Silver"), class = "factor"), Color = structure( 
      c(1L,5L, 1L, 1L, 5L, 1L, 1L, 4L, 2L, 5L, 3L), .Label = c("Blue", "Brown", 
      "Orange", "Pink", "Red"), class = "factor"), Value = c(0.5, 0, 
      0.75, 1.5, 0.7, 0.41, 0.83, 1.17, 0.43, 0.71, 1.92), Samples = c(500L, 
      449L, 1320L, 103L, 891L, 18103L, 564L, 173L, 793L, 1763L, 503L)), 
      .Names = c("Date", "Model", "Color", "Value", "Samples"), 
      class = "data.frame", row.names = c(NA, -11L)) 
      

      【讨论】:

      • 如何修改该代码以输出样本?谢谢。
      • @equanimity 如果您仍然感兴趣,请查看更新。
      猜你喜欢
      • 2021-06-23
      • 2020-12-07
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多