【发布时间】:2018-06-22 00:01:54
【问题描述】:
在 dplyr 汇总函数上应用 for 循环时,我得到了奇怪的结果 - 不知道为什么或如何解决它。
test <- data.frame(title = c("a", "b", "c","a","b","c", "a", "b", "c","a","b","c"),
category = c("A", "A", "A", "A", "A", "A", "B", "B", "B", "B", "B", "B"),
sex = c("m", "m", "m", "f", "f", "f", "m", "m", "m", "f", "f", "f"),
salary = c(50,70,90,40,60,85, 220,270,350,180,200,330))
category_list <- unique(test$category)
tmp = list()
for (category in category_list) {
# Create an average salary line for the category
tmp[category] <- test %>%
filter(category == category) %>%
summarise(mean(salary))
print(tmp)
}
我得到这个作为输出
$A
[1] 162.0833
$A
[1] 162.0833
$B
[1] 162.0833
group_by() 函数返回适当的结果:
test %>% group_by(category) %>% summarise(mean(salary))
# A tibble: 2 x 2
category `mean(salary)`
<fct> <dbl>
1 A 65.8
2 B 258.
替换特定类别确实返回适当的结果:
test %>%
filter(category == "A") %>%
summarise(mean(salary))
mean(salary)
1 65.83333
所以category_list 对象可能有问题?
令人惊讶的是,当我调用 category_list 对象的第一个元素时,我也得到了正确的答案:
test %>%
+ filter(category == category_list[1]) %>%
+ summarise(mean(salary))
mean(salary)
1 65.83333
我想弄清楚这一点(而不是使用group_by)的原因是因为我正在尝试制作一个脚本,该脚本将创建许多 ggplot 对象,然后将这些对象与gridExtra 库结合起来。
也许我错了,group_by 可以使用,但我能想到的唯一方法是使用以下伪代码:
- 1) 通过
category创建一个方法列表,用于geom_hline()参数 - 2) 通过
category对数据框对象进行子集化,每个子集将在ggplot 中使用其geom_hline() - 3) 为每个
category创建一个绘图对象列表 - 4) 在
for循环之外使用gridExtra库中的grid.arrange()将每个图组合在一起
这是我目前的代码(不工作):
library(gridExtra)
p = list()
avg_line = list()
tmp = list()
category_data = data.frame()
for (category in category_list) {
# Create an average salary line for the category
tmp[[category]] <- test %>%
filter(category == category) %>%
summarise(mean(salary))
avg_line[[category]] <- tmp[[2]]
# Subset data frame on category
category_data[[category]] <- test %>% filter(category == category)
# Make plots for each category
p[[category]] <-
ggplot(category_data[[category]], aes(x = title, y = salary)) +
geom_line(color = "white") +
geom_point(aes(color =sex)) +
scale_color_manual(values = c("#F49171", "#81C19C")) +
geom_hline(yintercept = avg_line[[category]], color = "white", alpha = 0.6, size = 1) +
theme(legend.position = "none",
panel.background = element_rect(color = "#242B47", fill = "#242B47"),
plot.background = element_rect(color = "#242B47", fill = "#242B47"),
axis.line = element_line(color = "grey48", size = 0.05, linetype = "dotted"),
axis.text = element_text(family = "Georgia", color = "white"),
axis.text.x = element_text(angle = 90),
# Get rid of the y- and x-axis titles
axis.title.y=element_blank(),
axis.title.x=element_blank(),
panel.grid.major.y = element_line(color = "grey48", size = 0.05),
panel.grid.minor.y = element_blank(),
panel.grid.major.x = element_blank())
}
grid.arrange(grobs = p, nrow = 1)
我想要的输出是这样的:
【问题讨论】:
-
难道你不能做一些像
yintercept = mean(category_data[[category]]$salary)这样的事情,而不是费力地制作一个新的数据集吗?老实说,如果我通过split将事物按组拆分为data.frames 列表,然后使用lapply或purrr::map循环来制作绘图,我会发现这类任务最简单。 -
这是一个
split-map示例,如果您要采用不同的策略:stackoverflow.com/a/46572595/2461552 -
这是怎么回事?过滤器(类别 == 类别)。您正在将类别与自身进行比较,当然答案将是相同的。