使用 tidyr top_n 按变量选择并包含 NA答案

【问题标题】：Using tidyr top_n to select by a variable and include NAs使用 tidyr top_n 按变量选择并包含 NA
【发布时间】：2019-10-25 07:51:59
【问题描述】：

我正在尝试使用 dplyr 按变量进行分组，并确定我数据集中每个位置的最近位置。我还想包括尚未测量距离（NA）的所有行。

# Set up df of place, distance, and destination.
df <- data.frame(place = c('A','B','B','C','C','D','D'),dist = c(NA, 4, 1, 6, 3, 1, 1), dest = 1:7)

# For each place, get the nearest destination. 
df %>% 
  group_by(place) %>%
  top_n(1, desc(dist))

# This does not return a row for place A.

是否有使用 top_n 来识别基于排名的行的 tidyr 解决方案，其中还包括未排名的行？提前谢谢你。

【问题讨论】：

在某些情况下，place 的 dist 的值可能 > 1，其中一些值为 NA，而另一些则不是？如果是这样，这些情况应该返回什么？

标签： r sorting dplyr

【解决方案1】：

这可行，但可能有更有效的解决方案。

coalesce(dist, max(dist), ...) 之所以存在，是因为我们优先考虑非空值。然后，我们要确保随机值不会出现在 top_n 中，因此我们采用该组的 max(dist)。最后，为了真正返回一个值，我输入了一个数字——你可以使用任何数字。

如果您使用非 desc，您可能会使用 min(dist) 而不是 max(dist)。

df %>% 
  group_by(place) %>%
  top_n(1, desc(coalesce(dist, max(dist)+1, 0)))

  place  dist  dest
  <fct> <dbl> <int>
1 A        NA     1
2 B         1     3
3 C         3     5
4 D         1     6
5 D         1     7

【讨论】：