使用 group_by、nesting()、complete() 并计算 R 中庞大数据集的时间间隔答案

【问题标题】：Using group_by, nesting(), complete() and compute time interval over a huge data set in R使用 group_by、nesting()、complete() 并计算 R 中庞大数据集的时间间隔
【发布时间】：2020-10-02 19:49:00
【问题描述】：

我在数据上苦苦挣扎了很长时间，但我不知道如何解决我的问题。我研究营养数据，这些数据可以被这个数据集伪造：

library(tidyverse)
library(lubridate)

# Used for data generation
groupFunction <- function(cat){
  case_when(
  cat == "apple" ~ "food",
  cat == "bread" ~ "food",
  cat == "cheese" ~ "food",
  cat == "chocolate" ~ "candy",
  cat == "water" ~ "drink",
  cat == "tea" ~ "drink"
  )
}

# Generate the data

set.seed(0)

fakeData <- tibble(
  id = c(rep("A", 10), rep("B", 10), rep("C", 10), rep("D", 10)),
  eaten_at = sample(seq(as.POSIXct('2020/01/01'), as.POSIXct('2020/01/05'), by="15 min"), 40),
  category = sample(rep(c("apple", "bread", "cheese", "chocolate", "water", "tea"), 10), 40),
  group = groupFunction(category), 
  amount = sample(10:100, 40)
)

# For every id, for each day, every hour and each category: sum the eaten amount, 
# and keep 0 eaten amount so it is encounted in the mean calculation in step 2!
# PROBLEM: we loose time intervals where a given id didn't eat anything, this will
# biais the mean calculation in step 2!
step1 <- fakeData %>%
  mutate(eaten_at_hour = hour(eaten_at)) %>%
  group_by(id, eaten_at, eaten_at_hour, category, group) %>%
  summarise(eaten_amount = sum(amount)) %>%
  ungroup() %>%
  complete(nesting(id, eaten_at, eaten_at_hour), 
           nesting(category, group), 
           fill = list(eaten_amount = 0)) 

# For every id, mean over the days the eaten amount for every hour interval. 
# As before, keep 0 counts so it's encounted in the mean calculation in step 3!
step2 <- step1 %>%
  group_by(id, eaten_at_hour, category, group) %>%
  summarise(mean_per_id = mean(eaten_amount)) %>%
  ungroup() %>%
  complete(nesting(id, eaten_at_hour),
           nesting(category, group),
           fill = list(mean_per_id = 0))

# Mean over all id 
step3 <- step2 %>%
  group_by(eaten_at_hour, category, group) %>%
  summarise(mean_for_all = mean(mean_per_id)) %>%
  ungroup() 

# Plot the data
ggplot(step3, aes(x=eaten_at_hour, y=category, color = mean_for_all, shape = group)) +
  geom_point( size = 3) + 
  scale_color_gradient(low="blue", high="red", "Mean eaten\namount [g]")

我要构建的是 x 轴上 1 小时时间间隔和 y 轴上不同食物类别的图，在 24 小时内每个 X 分钟期间所有 id 的平均食用量（即时间间隔必须灵活）。我想要一个看起来像这样的情节：

我的想法是计算：

对于每个 ID，
这个 id 每天都吃东西，
对于每个 X 小时的时间间隔（即使 id 没有吃任何东西），
对于每个食品类别：

->合计吃的量

然后：

对于每个 id，
对于每个类别，
在参与期间每 1 小时间隔一次：

->平均食用量

然后：

-> 平均所有 id 以便我们获得每个类别和 24 小时内每 1 小时间隔的平均食用量

为此，我使用了 group_by()、nesting() 和 complete() 函数。但我有 3 个问题：

我希望能够设置所需的时间间隔，可以是 15 分钟，也可以是 2 小时。我还没有找到任何解决方案。
即使他们没有吃任何东西，我也需要为所有 id 设置所有时间间隔（所以 amout = 0），因为当我的意思是几天或 id 之间时，如果我不吃，平均值就会有偏差t 包括零计数。
我的实际数据集包括大约 100k 行，所以我认为我的做法在效率方面不是最合适的。此外，我想为这些数据设计一个闪亮的应用程序，例如，用户可以手动设置时间间隔，这意味着必须一次又一次地计算绘图（当代码效率不高时，计算机的大量工作...... )

我知道我的问题完全针对特定问题，但由于我真的被阻止了，我非常感谢任何关于我的一个或两个问题的帮助/输入/想法。非常感谢！

【问题讨论】：

“我使用 group_by()、nesting() 和 complete() 函数。”。你能和我们分享这段代码吗？
感谢丹的编辑。我添加了我的代码并试图澄清我的问题。

标签： r dplyr lubridate

【解决方案1】：

我不确定我是否完全理解您的问题，但这里有一个答案草稿。

首先，按时间间隔分组的一种棘手方法是将小时（使用lubridate::hour）除以步长，然后将结果乘以步长。然后，我按 id、hour 和 group 分组来求和，然后只按 hour 和 group 来计算平均值。

eaten_n_hours = 2
df = fakeData %>% 
    mutate(hour = floor(hour(eaten_at)/eaten_n_hours)*eaten_n_hours) %>% 
    group_by(id, hour, group) %>% 
    summarise(amount = sum(amount, na.rm=TRUE)) %>% 
    group_by(hour, group) %>%
    summarise(amount_m = mean(amount, na.rm=T),
              amount_sd = sd(amount, na.rm=T)) %>%
    identity()

然后，您可以像这样绘制整个事情：

breaks_hour = seq(min(df$hour), max(df$hour)+1, eaten_n_hours)
ggplot(df, aes(x=hour, y=amount_m, group=group, color=group, fill=group))+
    geom_col(position="dodge") +
    # geom_errorbar(aes(ymin=amount_m-amount_sd, ymax=amount_m+amount_sd), position="dodge") +
    scale_x_binned(breaks=breaks_hour)

这不是有史以来最漂亮的情节，但我不确定这是由于我对问题的不理解还是示例fakeData。

编辑

我不熟悉瓷砖，但你可以尝试使用geom_tiles 这种方式。此外，使用scales::breaks_width 允许具有灵活的时间间隔。

ggplot(df, aes(x=hour, y=group, fill=amount_m))+
    geom_tile()+
    scale_x_binned(breaks=scales::breaks_width(3)) # try other values

【讨论】：

感谢 intup。它有帮助，但我的问题仍然没有解决：如何获得灵活的时间间隔，以及如何保持 0 计数以进行平均计算。
@ludo 我对瓷砖不熟悉，我尝试了一些东西，但恐怕这还不够。祝你好运！
感谢您的帮助。现在我的问题更清楚了:)
是的，瓷砖看起来很有希望，我会去看看！

【解决方案2】：

所以我想出了一个方法来做到这一点（感谢@Dan Chaltiel），它肯定不是完美的，但我会在这里发布，以便对其他人/或讨论有用：


library(tidyverse)
library(lubridate)

# Used for data generation
groupFunction <- function(cat){
  case_when(
    cat == "apple" ~ "food",
    cat == "bread" ~ "food",
    cat == "cheese" ~ "food",
    cat == "chocolate" ~ "candy",
    cat == "water" ~ "drink",
    cat == "tea" ~ "drink"
  )
}

# Generate the data
set.seed(0)

fakeData <- tibble(
  id = c(rep("A", 10), rep("B", 10), rep("C", 10), rep("D", 10)),
  eaten_at = sample(seq(as.POSIXct('2020/01/01 22:00:00'), as.POSIXct('2020/01/05'), by="17 min"), 40),
  category = sample(rep(c("apple", "bread", "cheese", "chocolate", "water", "tea"), 10), 40),
  group = groupFunction(category), 
  amount = sample(10:100, 40)
)

# Set time interval in minutes here (0-60 min only): 
set_time <- 60

# Generate time sequence for one day (1440 seconds), with the desired interval input. Then set it as factor. 
timeLevels <- seq(from = as.POSIXct("2020-1-1 0:00"), by = paste(set_time, "min", sep = " "), length.out = 1440/set_time)
timeLevels <- paste(hour(timeLevels), minute(timeLevels), sep = ":")

# Calculate the means, keeping zero counts
toPlot <- fakeData %>%
  mutate(eaten_hour = floor_date(eaten_at, unit = paste(set_time, "min", sep = " ")), 
         eaten_hour = paste(hour(eaten_hour), minute(eaten_hour), sep = ":"),
         eaten_hour = factor(eaten_hour, levels = timeLevels),
         eaten_date = date(eaten_at)) %>%
  group_by(eaten_date, eaten_hour, category, group) %>%
  summarise(sum_amount = sum(amount)) %>%
  ungroup() %>%
  complete(eaten_date, eaten_hour, nesting(category, group), fill = list(sum_amount = 0)) %>%
  group_by(eaten_hour, category, group) %>%
  summarise(mean_amount = mean(sum_amount)) %>% 
  ungroup()

# Plot the data
gg <- ggplot(toPlot, aes(x=eaten_hour, y=category, fill=mean_amount))
gg <- gg + geom_tile(color="white", size=0.1)
gg <- gg + coord_equal() + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
gg <- gg + labs(x = "Time of the day", y = NULL, title = "Mean eaten quantity over one day", fill = "Mean amount [g]")
gg

输出如下：

仍然开放关于如何改进我的代码的任何意见！

【讨论】：