传播与 dcast答案

【问题标题】：Spread vs dcast传播与 dcast
【发布时间】：2016-05-15 11:36:13
【问题描述】：

我有一张这样的桌子，

> head(dt2)
  Weight Height   Fitted interval limit    value
1   65.6  174.0 71.91200     pred   lwr 53.73165
2   80.7  193.5 91.63237     pred   lwr 73.33198
3   72.6  186.5 84.55326     pred   lwr 66.31751
4   78.8  187.2 85.26117     pred   lwr 67.02004
5   74.8  181.5 79.49675     pred   lwr 61.29244
6   86.4  184.0 82.02501     pred   lwr 63.80652

我希望它有这样的，

> head(reshape2::dcast(dt2, 
         Weight + Height + Fitted + interval ~ limit, 
         fun.aggregate = mean))
  Weight Height   Fitted interval      lwr      upr
1   42.0  153.4 51.07920     conf 49.15463 53.00376
2   42.0  153.4 51.07920     pred 32.82122 69.33717
3   43.2  160.0 57.75378     conf 56.35240 59.15516
4   43.2  160.0 57.75378     pred 39.54352 75.96404
5   44.8  149.5 47.13512     conf 44.87642 49.39382
6   44.8  149.5 47.13512     pred 28.83891 65.43133

但是使用tidyr::spread，我该怎么做呢？

我正在使用，

> tidyr::spread(dt2, limit, value)

但得到错误，

Error: Duplicate identifiers for rows (1052, 1056), (238, 242), (1209, 1218), (395, 404), (839, 1170), (25, 356), (1173, 1203, 1215), (359, 389, 401), (1001, 1200), (187, 386), (906, 907), (92, 93), (930, 1144), (116, 330), (958, 1171), (144, 357), (902, 1018), (88, 204), (960, 1008), (146, 194), (1459, 1463), (645, 649), (1616, 1625), (802, 811), (1246, 1577), (432, 763), (1580, 1610, 1622), (766, 796, 808), (1408, 1607), (594, 793), (1313, 1314), (499, 500), (1337, 1551), (523, 737), (1365, 1578), (551, 764), (1309, 1425), (495, 611), (1367, 1415), (553, 601)

随机 10 行::

> dt[sample(nrow(dt), 10), ]
     Weight Height   Fitted interval limit    value
1253   52.2  162.5 60.28203     conf   upr 61.51087
426    49.1  158.8 56.54022     pred   upr 74.75756
1117   78.4  184.5 82.53066     conf   lwr 80.98778
1171   85.9  166.4 64.22611     conf   lwr 63.21254
948    61.4  177.8 75.75494     conf   lwr 74.66393
384    90.9  172.7 70.59731     pred   lwr 52.41828
289    75.9  172.7 70.59731     pred   lwr 52.41828
3      44.8  149.5 47.13512     pred   lwr 28.83891
774    87.3  182.9 80.91258     pred   upr 99.12445
772    86.4  175.3 73.22669     pred   upr 91.40919

【问题讨论】：

您的示例在limit 中不包含upr，在interval 中也不包含conf，这意味着您的预期结果不可重现
为什么不将其保存为长格式并进行汇总？请参阅here for an example 与基础 R、dplyr 和 data.table。
虽然我已经用 dcast 完成了，但我想用 tidyr 来完成它只是为了学习。 @mtoto 这只是我的数据集的一个头，我会编辑它给你一个随机样本，以便重现性。
这应该可以工作：dt2 %>% group_by(interval, limit) %>% summarise_each(funs(mean)) %>% spread(limit, value, -c(1:3))
按区间和限制汇总，只给了我两行。

标签： r reshape2 tidyr

【解决方案1】：

假设您从如下所示的数据开始：

mydf
#   Weight Height  Fitted interval limit    value
# 1     42  153.4 51.0792     conf   lwr 49.15463
# 2     42  153.4 51.0792     pred   lwr 32.82122
# 3     42  153.4 51.0792     conf   upr 53.00376
# 4     42  153.4 51.0792     pred   upr 69.33717
# 5     42  153.4 51.0792     conf   lwr 60.00000
# 6     42  153.4 51.0792     pred   lwr 90.00000

请注意分组列（1 到 5）的第 5 行和第 6 行中的重复项。这基本上就是“tidyr”告诉你的。第一行和第五行是重复的，第二行和第六行也是如此。

tidyr::spread(mydf, limit, value)
# Error: Duplicate identifiers for rows (1, 5), (2, 6)

正如@Jaap 所建议的，解决方案是首先“汇总”数据。由于“tidyr”仅用于整形数据（与聚合和整形的“reshape2”不同），您需要在更改数据形式之前使用“dplyr”执行聚合。在这里，我在“值”列中使用summarise。

如果您在summarise 步骤停止执行，您会发现我们原来的 6 行数据集已“缩小”为 4 行。现在，spread 将按预期工作。

mydf %>% 
  group_by(Weight, Height, Fitted, interval, limit) %>% 
  summarise(value = mean(value)) %>% 
  spread(limit, value)
# Source: local data frame [2 x 6]
# 
#   Weight Height  Fitted interval      lwr      upr
#    (dbl)  (dbl)   (dbl)    (chr)    (dbl)    (dbl)
# 1     42  153.4 51.0792     conf 54.57731 53.00376
# 2     42  153.4 51.0792     pred 61.41061 69.33717

这将dcast 的预期输出与fun.aggregate = mean 匹配。

reshape2::dcast(mydf, Weight + Height + Fitted + interval ~ limit, fun.aggregate = mean)
#   Weight Height  Fitted interval      lwr      upr
# 1     42  153.4 51.0792     conf 54.57731 53.00376
# 2     42  153.4 51.0792     pred 61.41061 69.33717

样本数据：

 mydf <- structure(list(Weight = c(42, 42, 42, 42, 42, 42), Height = c(153.4, 
     153.4, 153.4, 153.4, 153.4, 153.4), Fitted = c(51.0792, 51.0792,         
     51.0792, 51.0792, 51.0792, 51.0792), interval = c("conf", "pred",        
     "conf", "pred", "conf", "pred"), limit = structure(c(1L, 1L,             
     2L, 2L, 1L, 1L), .Label = c("lwr", "upr"), class = "factor"),            
         value = c(49.15463, 32.82122, 53.00376, 69.33717, 60,          
         90)), .Names = c("Weight", "Height", "Fitted", "interval",     
     "limit", "value"), row.names = c(NA, 6L), class = "data.frame")

【讨论】：

谢谢！我在考虑如何处理聚合函数。我认为 Hadely 希望 tidyr 与 dplyr 一起使用。
这是一个很好的答案，让我理解了dcast 和spread 之间的区别。谢谢！

【解决方案2】：

这里是data.table 的替代dplyr。使用阿南达回答中的mydf。

library(data.table)
library(magrittr)
library(tidyr)

DT <- data.table(mydf)

首先，您可以使用by 计算每个限制的平均值。

DT[, .(lwr = mean(value[limit == "lwr"]), 
       upr = mean(value[limit == "upr"])), 
   by = .(Weight, Height, Fitted, interval)]

如果这个limit == ...看起来太硬编码，可以先聚合成长格式，再聚合成spread。这是有效的，因为一旦你聚合，就没有重复了。

DT[, .(value = mean(value)), by = .(Weight, Height, Fitted, interval, limit)] %>%
  spread(key = "limit", value = "value")

两者都能满足你

#   Weight Height  Fitted interval      lwr      upr
#1:     42  153.4 51.0792     conf 54.57731 53.00376
#2:     42  153.4 51.0792     pred 61.41061 69.33717

【讨论】：

谢谢，其实我说的是dplyr 和tidyr。我已经用reshape2 解决了这个问题，但我想知道如何使用这些特定的包来解决这个问题。还是谢谢！