dplyr 操作按行分组变异答案

【问题标题】：dplyr manipulation rowwise grouping mutatedplyr 操作按行分组变异
【发布时间】：2017-05-14 09:34:54
【问题描述】：

我有数据集

x <- data.frame(Postcode = c(1, 2, 3, 4, 5, 6), 
                Latitude = c(3.1, 3.2, 3.3, 3.3, 3.4, 3.4),
                Longitude = c(100, 101, 102, 102, 103, 104),
                Exposure = c(1, 2, 3, 4, 5, 6))

我正在尝试操作 x 内部的数据变为

x <- data.frame(Postcode = c(1, 2, 3, 4, 5, 6), 
                Latitude = c(3.1, 3.2, 3.3, 3.3, 3.4, 3.4),
                Longitude = c(100, 101, 102, 102, 103, 104),
                Exposure = c(1, 2, 3, 4, 5, 6),
                coords = c("3.1, 100", "3.2, 101", "3.3, 102", "3.3, 102",
                           "3.4, 103", "3.4, 104"),
                postcode = c("1", "2", "3,4", "3,4", "5", "6"),
                exposure = c(1, 2, 7, 7, 5, 6))

新列postcode 会将具有相同Latitude 和Longitude 的Postcode 粘贴在一起。 coords 将粘贴Latitude 和Longitude，而exposure 将求和具有相同coords 的Exposure，即相同的Latitude 和Longitude。

我可以通过使用dplyr 包和for 循环来完成此操作

x <- mutate(x, coords = paste(Latitude, Longitude, sep = ", "))
x <- cbind(x, postcode = rep(0, nrow(x)), exposure = rep(0, nrow(x)))
for(i in unique(x$coords)){
  x$postcode[x$coords == i] <- paste(x$Postcode[x$coords == i], collapse = ", ")
  x$exposure[x$coords == i] <- sum(x$Exposure[x$coords == i])
}

我怎样才能通过仅使用 dplyr 包而不使用 for 循环来完成此操作？或者可能比使用for 循环更有效的其他方法，因为我的实际数据集非常大

【问题讨论】：

第二个数据集的元素数量不相等。请更新它
@akrun 我已经编辑过了。感谢您的通知
如果不修复就会关闭：Data.frame 中的错误（Postcode = c(0, 1, 2, 3, 4, 5, 6), Latitude = c(3.1 , : 参数暗示不同的行数：7, 6
@hrbrmstr 抱歉，我已经修复了

标签： r dplyr

【解决方案1】：

library(dplyr)
library(tidyr)  # unite() was used to join Lat, Lon

x %>% unite(coords, Latitude, Longitude, sep = ",", remove = FALSE) %>% 
  group_by(coords) %>% mutate(exposure = sum(Postcode), postcode = toString(Postcode))

【讨论】：

您好，非常感谢您的解决方案。我的数据实际上存储在一个 spatialpointsdataframe 对象中。我以为我可以使用这种方法来操作它，我只是意识到我不能使用 group_by 来处理 spatialpointsdataframe 对象中的数据。有什么建议吗？
我以为我可以先从 sp 对象中提取数据，然后再应用 dplyr，但我注意到它改变了对象，所以我无法将其存储回 sp 对象
实际上我在sp 对象上工作不多。你能分享一下你所面临的问题的更多细节吗

【解决方案2】：

下面是使用dplyr 的方法：

library(dplyr)
x %>% 
     group_by(coords = paste(Latitude, Longitude, sep = ", ")) %>% 
     mutate(postcode = toString(Postcode), exposure = sum(Exposure))

# Source: local data frame [6 x 7]
# Groups: coords [5]
# 
#   Postcode Latitude Longitude Exposure   coords postcode exposure
#      <dbl>    <dbl>     <dbl>    <dbl>    <chr>    <chr>    <dbl>
# 1        1      3.1       100        1 3.1, 100        1        1
# 2        2      3.2       101        2 3.2, 101        2        2
# 3        3      3.3       102        3 3.3, 102     3, 4        7
# 4        4      3.3       102        4 3.3, 102     3, 4        7
# 5        5      3.4       103        5 3.4, 103        5        5
# 6        6      3.4       104        6 3.4, 104        6        6

【讨论】：

您好，非常感谢您的解决方案。我的数据实际上存储在一个 spatialpointsdataframe 对象中。我以为我可以使用这种方法来操作它，我只是意识到我不能使用 group_by 来处理 spatialpointsdataframe 对象中的数据。有什么建议吗？
我以为我可以先从 sp 对象中提取数据，然后再应用 dplyr，但我注意到它改变了对象，所以我无法将其存储回 sp 对象
无论如何，谢谢。我只知道你可以这样使用 group_by
我对 sp 数据框也不太了解。无法提供有关该主题的更多有用信息。

【解决方案3】：

我们可以通过data.table 做到这一点

library(data.table)
setDT(x)[, coords := paste(Latitude, Longitude, sep="," )
  ][, c("exposure", "postcode") :=.(sum(Postcode), toString(Postcode)), coords]
x
#   Postcode Latitude Longitude Exposure  coords exposure postcode
#1:        1      3.1       100        1 3.1,100        1        1
#2:        2      3.2       101        2 3.2,101        2        2
#3:        3      3.3       102        3 3.3,102        7     3, 4
#4:        4      3.3       102        4 3.3,102        7     3, 4
#5:        5      3.4       103        5 3.4,103        5        5
#6:        6      3.4       104        6 3.4,104        6        6

【讨论】：