n_distinct 是磁盘帧的精确计算吗？

【问题标题】：Is n_distinct an exact calculation with disk frames?n_distinct 是磁盘帧的精确计算吗？
【发布时间】：2020-09-12 17:13:55
【问题描述】：

我在一个大文件 (>30GB) 上运行 n_distinct，但它似乎没有产生准确的结果。

我有另一个数据参考点，在磁盘帧聚合中输出关闭。

它在文档中提到 n_distinct 是精确计算，而不是估计。

对吗？

【问题讨论】：

在相当简洁的帮助页面中提到n_unique 是length(unique(x)) 的更快版本。
我不熟悉disk.frame，你是否有可能为每个块计算n_distinct，这样如果一个值出现在不同的块中，它就会被计算多次？
我的理解是区分每个块，然后区分完整列表

标签： r disk.frame

【解决方案1】：

n_distinct的实现可以在这个页面找到https://github.com/xiaodaigh/disk.frame/blob/master/R/one-stage-verbs.R

#' @export
#' @rdname one-stage-group-by-verbs
n_distinct_df.chunk_agg.disk.frame <- function(x, na.rm = FALSE, ...) {
  if(na.rm) {
    setdiff(unique(x), NA)
  } else {
    unique(x)
  }
}

#' @export
#' @importFrom dplyr n_distinct
#' @rdname one-stage-group-by-verbs
n_distinct_df.collected_agg.disk.frame <- function(listx, ...) {
  n_distinct(unlist(listx))
}

现在，它看起来是我想要的精确计算。逻辑很简单，它计算每个块内的unique，然后在所有块收集后的结果上计算n_distinct。

但不能排除其他地方是否有bug。

您是否有测试用例表明它不完全正确？也许你可以贡献一个 PR 来测试？

【讨论】：