将 `top_n` 和 `arrange` 传递给 ggplot (dplyr)答案

【问题标题】：Passing `top_n` and `arrange` to ggplot (dplyr)将 `top_n` 和 `arrange` 传递给 ggplot (dplyr)
【发布时间】：2018-05-16 16:25:21
【问题描述】：

TidyText Mining Section 3.3 中有一段可爱的代码，我正试图在我自己的数据集中进行复制。但是，在我的数据中，我无法让 ggplot “记住”我想要按降序排列的数据，并且我想要某个 top_n。

我可以运行 TidyText Mining 中的代码，并得到与书中显示的相同的图表。但是，当我在自己的数据集上运行此程序时，构面包装不显示 top_n （它们似乎显示随机数量的类别），并且每个构面中的数据未按降序排序。

我可以用一些随机文本数据和完整代码复制这个问题——但我也可以用mtcars 复制这个问题——这真的让我很困惑。

我希望下面的图表按降序显示每个方面的 mpg，并且每个方面只给我顶部的 1 类别。它不适合我。

require(tidyverse)

mtcars %>%
  arrange (desc(mpg)) %>%
  mutate (gear = factor(gear, levels = rev(unique(gear)))) %>%
  group_by(am) %>%
  top_n(1) %>%
  ungroup %>%
  ggplot (aes (gear, mpg, fill = am)) +
  geom_col (show.legend = FALSE) +
  labs (x = NULL, y = "mpg") +
  facet_wrap(~am, ncol = 2, scales = "free") + 
  coord_flip()

但我真正想要的是有一个像 TidyText 书中那样排序的图表（仅数据示例）。

require(tidyverse)
require(tidytext)

starwars <- tibble (film = c("ANH", "ESB", "ROJ"),
                  text = c("It is a period of civil war. Rebel spaceships, striking from a hidden base, have won their first victory against the evil Galactic Empire. During the battle, Rebel spies managed to steal secret plans to the Empire's ultimate weapon, the DEATH STAR, an armored space station with enough power to destroy an entire planet. Pursued by the Empire's sinister agents, Princess Leia races home aboard her starship, custodian of the stolen plans that can save her people and restore freedom to the galaxy.....",
                           "It is a dark time for the Rebellion. Although the Death Star has been destroyed, Imperial troops have driven the Rebel forces from their hidden base and pursued them across the galaxy. Evading the dreaded Imperial Starfleet, a group of freedom fighters led by Luke Skywalker has established a new secret base on the remote ice world of Hoth. The evil lord Darth Vader, obsessed with finding young Skywalker, has dispatched thousands of remote probes into the far reaches of space....",
                           "Luke Skywalker has returned to his home planet of Tatooine in an attempt to rescue his friend Han Solo from the clutches of the vile gangster Jabba the Hutt. Little does Luke know that the GALACTIC EMPIRE has secretly begun construction on a new armored space station even more powerful than the first dreaded Death Star. When completed, this ultimate weapon will spell certain doom for the small band of rebels struggling to restore freedom to the galaxy...")) %>%
  unnest_tokens(word, text) %>%
  mutate(film = as.factor(film)) %>%
  count(film, word, sort = TRUE) %>%
  ungroup()

total_wars <- starwars %>%
  group_by(film) %>%
  summarize(total = sum(n))

starwars <- left_join(starwars, total_wars)

starwars <- starwars %>%
  bind_tf_idf(word, film, n)

starwars %>%
  arrange(desc(tf_idf)) %>%
  mutate(word = factor(word, levels = rev(unique(word)))) %>%
  group_by(film) %>%
  top_n(10) %>%
  ungroup %>%
  ggplot(aes(word, tf_idf, fill = film)) +
  geom_col(show.legend = FALSE) +
  labs (x = NULL, y = "tf-idf") +
  facet_wrap(~film, ncol = 2, scales = "free") +
  coord_flip()

【问题讨论】：

您对mtcars 代码的前几行有何期待？如果您按am 分组并取最高的mpg，则您有一个2 行数据框，因为am 只有2 个值。这是你的意图吗？
嗨 Camillle 14 - 是的，这就是目的 - 数据框应该（并且确实）按 mpg 排序，无论您要求多少，但这似乎并没有在我的任何数据集中传递给 ggplot （但适用于 TidyText 书中更大的数据示例）

标签： r ggplot2 tidytext

【解决方案1】：

我相信在这里让你感到困惑的是top_n() 默认为表中的最后一个变量，除非你告诉它使用什么变量进行排序。在我们书中的示例中，数据框中的最后一个变量是tf_idf，因此这是用于排序的。在 mtcars 示例中，top_n() 使用数据框中的最后一列进行排序；恰好是carb。

您始终可以通过将其作为参数传递来告诉top_n() 要使用哪个变量进行排序。例如，使用 diamonds 数据集查看这个类似的工作流程。

library(tidyverse)

diamonds %>%
  arrange(desc(price)) %>%
  group_by(clarity) %>%
  top_n(10, price) %>%
  ungroup %>%
  ggplot(aes(cut, price, fill = clarity)) +
  geom_col(show.legend = FALSE, ) +
  facet_wrap(~clarity, scales = "free") + 
  scale_x_discrete(drop=FALSE) +
  coord_flip()

由reprex package (v0.2.0) 于 2018 年 5 月 17 日创建。

这些示例数据集并不是完全平行的，因为它们不像整洁的文本数据框那样每个特征组合都有一行。不过，我很确定 top_n() 的问题就是问题所在。

【讨论】：

嗨，朱莉娅！我非常感谢您的意见-您注意到我，我有点喜欢 ;) 指定 top_n(n, variable) 确实给了我每个方面我期望的字数，但我的最终 ggplot图表仍然没有通过降序 tf_idf 来“排列”。所以在我的随机starwars 示例中指定top_n(10, word) 为我们提供了前10 个单词，但ROJ 仍然在列表中排名第三，即使它具有最小的tf_idf。我希望这是我缺少的一个非常明显的修复，所以我非常感谢您的帮助！ :)
再次嗨 - 我想我在玩了一会儿之后开始掌握这个。我曾假设top_n 正在选择要在每个方面显示的单词数，但我现在意识到事实并非如此。所以回到星球大战的例子，如果我指定 top_n(5, tf_idf) 它实际上是告诉 ggplot 选择前 5 个 tf_idf 。 . .但我生成的图表显示了很多单词。所以我认为我的问题实际上是：如何选择每个方面显示多少个单词？我想知道我是否误解了您的代码在书中的工作原理？
最终更新！我回去在你的例子中玩你的代码，我意识到了一些事情。你要求top_n(15)，我假设在每个方面我得到了 15 个单词。我现在仔细数了一下，在“Northanger Abbey”和“P&'P”这两个方面实际上有 16 个单词。所以代码并没有像我想象的那样做。我会将此标记为已解决，因为我现在意识到我不理解原始示例 - 非常感谢！
啊，我想我可能明白这里发生了什么。当存在 tie 时（即完全相同的 tf-idf 分数，这可能发生在像简·奥斯汀所有 6 部小说或星球大战电影这样的小型数据集上），那么top_n() 不会打破这种关系;它将所有项目保持在该等级。
是的，这也是我的结论——我只是没有正确阅读您示例中的代码！非常感谢您的帮助:)