【问题标题】:for loop with dplyr summarise returning different results than group_by带有 dplyr 的 for 循环汇总返回与 group_by 不同的结果
【发布时间】:2018-06-22 00:01:54
【问题描述】:

dplyr 汇总函数上应用 for 循环时,我得到了奇怪的结果 - 不知道为什么或如何解决它。

test <- data.frame(title = c("a", "b", "c","a","b","c", "a", "b", "c","a","b","c"),
                       category = c("A", "A", "A", "A", "A", "A", "B", "B", "B", "B", "B", "B"),
                       sex = c("m", "m", "m", "f", "f", "f", "m", "m", "m", "f", "f", "f"),
                       salary = c(50,70,90,40,60,85, 220,270,350,180,200,330))

category_list <- unique(test$category)

tmp = list()

for (category in category_list) {
  # Create an average salary line for the category
  tmp[category] <- test %>% 
    filter(category == category) %>%
    summarise(mean(salary))
  print(tmp)
}

我得到这个作为输出

$A
[1] 162.0833

$A
[1] 162.0833

$B
[1] 162.0833

group_by() 函数返回适当的结果:

    test %>% group_by(category) %>% summarise(mean(salary))
# A tibble: 2 x 2
  category `mean(salary)`
  <fct>             <dbl>
1 A                  65.8
2 B                 258.

替换特定类别确实返回适当的结果:

test %>% 
        filter(category == "A") %>%
        summarise(mean(salary))
      mean(salary)
1     65.83333

所以category_list 对象可能有问题? 令人惊讶的是,当我调用 category_list 对象的第一个元素时,我也得到了正确的答案:

test %>% 
+     filter(category == category_list[1]) %>%
+     summarise(mean(salary))
  mean(salary)
1     65.83333

我想弄清楚这一点(而不是使用group_by)的原因是因为我正在尝试制作一个脚本,该脚本将创建许多 ggplot 对象,然后将这些对象与gridExtra 库结合起来。

也许我错了,group_by 可以使用,但我能想到的唯一方法是使用以下伪代码:

  • 1) 通过category 创建一个方法列表,用于geom_hline() 参数
  • 2) 通过category 对数据框对象进行子集化,每个子集将在ggplot 中使用其geom_hline()
  • 3) 为每个category 创建一个绘图对象列表
  • 4) 在for 循环之外使用gridExtra 库中的grid.arrange() 将每个图组合在一起

这是我目前的代码(不工作):

library(gridExtra)
p = list()
avg_line = list()
tmp = list()
category_data = data.frame()
for (category in category_list) {
  # Create an average salary line for the category
  tmp[[category]] <- test %>% 
    filter(category == category) %>%
    summarise(mean(salary))
  avg_line[[category]] <- tmp[[2]]

  # Subset data frame on category 
  category_data[[category]] <- test %>% filter(category == category)

  # Make plots for each category
  p[[category]] <-
    ggplot(category_data[[category]], aes(x = title, y = salary)) +
  geom_line(color = "white") +
  geom_point(aes(color =sex)) +
  scale_color_manual(values = c("#F49171", "#81C19C")) +
  geom_hline(yintercept = avg_line[[category]], color = "white", alpha = 0.6, size = 1) +
  theme(legend.position = "none",
      panel.background = element_rect(color = "#242B47", fill = "#242B47"),
      plot.background = element_rect(color = "#242B47", fill = "#242B47"),
      axis.line = element_line(color = "grey48", size = 0.05, linetype = "dotted"),
      axis.text = element_text(family = "Georgia", color = "white"),
      axis.text.x = element_text(angle = 90),
      # Get rid of the y- and x-axis titles
      axis.title.y=element_blank(),
      axis.title.x=element_blank(),
      panel.grid.major.y = element_line(color = "grey48", size = 0.05),
      panel.grid.minor.y = element_blank(),
      panel.grid.major.x = element_blank())
}

grid.arrange(grobs = p, nrow = 1)

我想要的输出是这样的:

【问题讨论】:

  • 难道你不能做一些像yintercept = mean(category_data[[category]]$salary) 这样的事情,而不是费力地制作一个新的数据集吗?老实说,如果我通过split 将事物按组拆分为data.frames 列表,然后使用lapplypurrr::map 循环来制作绘图,我会发现这类任务最简单。
  • 这是一个split-map 示例,如果您要采用不同的策略:stackoverflow.com/a/46572595/2461552
  • 这是怎么回事?过滤器(类别 == 类别)。您正在将类别与自身进行比较,当然答案将是相同的。

标签: r for-loop ggplot2 dplyr


【解决方案1】:

你的 for 循环中的问题是语句 filter(category == category)。这总是正确的,因为它两次都从您的数据中提取category。如果你真的想要你的 for 循环,只需在你的 for 循环中重命名迭代器。

但是,您根本不需要grid.arrangefacet_wrap 为您提供了您正在寻找的内容(您可能需要对构面标签进行一些重新格式化,这些标签使用以 strip 开头的主题元素进行控制):

category_means <- test %>% 
  group_by(category) %>%
  summarize_at(vars(salary), mean)

p <- test %>%
  # group_by(category) %>%
  ggplot(aes(x = title, y = salary, color = sex)) + 
  facet_wrap(~ category, nrow = 1, scales = "free_y") +  
  geom_line(color = 'white') + 
  geom_point() + 
  scale_color_manual(values = c("#F49171", "#81C19C")) +
  geom_hline(data = category_means, aes(yintercept = salary), color = 'white', alpha = 0.6, size = 1) + 
  theme(legend.position = "none",
    panel.background = element_rect(color = "#242B47", fill = "#242B47"),
    plot.background = element_rect(color = "#242B47", fill = "#242B47"),
    axis.line = element_line(color = "grey48", size = 0.05, linetype = "dotted"),
    axis.text = element_text(family = "Georgia", color = "white"),
    axis.text.x = element_text(angle = 90),
    # Get rid of the y- and x-axis titles
    axis.title.y=element_blank(),
    axis.title.x=element_blank(),
    panel.grid.major.y = element_line(color = "grey48", size = 0.05),
    panel.grid.minor.y = element_blank(),
    panel.grid.major.x = element_blank())
p

【讨论】:

    猜你喜欢
    • 2020-02-28
    • 1970-01-01
    • 1970-01-01
    • 2021-09-12
    • 1970-01-01
    • 2015-06-27
    • 2016-09-10
    • 1970-01-01
    • 2014-06-23
    相关资源
    最近更新 更多