【问题标题】:Dictionary is full! error message using dplyr字典满了!使用 dplyr 的错误消息
【发布时间】:2021-02-10 07:02:38
【问题描述】:

你好,我正在努力查字典,

这是一个头:

           V1 V2 V3  scaf_name
1: scaffold_0  1  1 scaffold_0
2: scaffold_0  2  1 scaffold_0
3: scaffold_0  3  1 scaffold_0
4: scaffold_0  4  1 scaffold_0
5: scaffold_0  5  1 scaffold_0
6: scaffold_0  6  1 scaffold_0

这是我尝试过的代码:

tab3<-tab %>% 
    group_by(scaf_name) %>%  
    summarise(Avg_group=mean(V3),Length=last(V2))

这是我收到的错误消息

Error: Internal error: Dictionary is full!

这是标签的尺寸

> dim(tab)
[1] 852355422         4

看来使用 dplyr 的数据框太大了,有人知道我该如何解决这个问题吗?

非常感谢

这是df的一小部分

> dput(tab_bis)
structure(list(V1 = c("scaffold_0", "scaffold_0", "scaffold_0", 
"scaffold_0", "scaffold_0", "scaffold_0", "scaffold_0", "scaffold_0", 
"scaffold_0", "scaffold_0", "scaffold_0", "scaffold_0", "scaffold_0", 
"scaffold_0", "scaffold_0", "scaffold_0", "scaffold_0", "scaffold_0", 
"scaffold_0", "scaffold_0", "scaffold_0", "scaffold_0", "scaffold_0", 
"scaffold_0", "scaffold_0", "scaffold_0", "scaffold_0", "scaffold_0", 
"scaffold_0", "scaffold_0"), V2 = 1:30, V3 = c(1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 2L, 2L, 2L, 3L, 4L, 4L, 4L, 5L, 5L, 5L, 5L, 5L, 
5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L), scaf_name = c("scaffold_0", 
"scaffold_0", "scaffold_0", "scaffold_0", "scaffold_0", "scaffold_0", 
"scaffold_0", "scaffold_0", "scaffold_0", "scaffold_0", "scaffold_0", 
"scaffold_0", "scaffold_0", "scaffold_0", "scaffold_0", "scaffold_0", 
"scaffold_0", "scaffold_0", "scaffold_0", "scaffold_0", "scaffold_0", 
"scaffold_0", "scaffold_0", "scaffold_0", "scaffold_0", "scaffold_0", 
"scaffold_0", "scaffold_0", "scaffold_0", "scaffold_0")), row.names = c(NA, 
-30L), class = c("data.table", "data.frame"), .internal.selfref = <pointer: 0x556f4666b340>)

【问题讨论】:

  • 你能用dput展示一个可重现的小例子吗
  • @akrun 确定我在末尾添加了 df 的简短摘录
  • 有了这些数据,我没有收到错误消息。可能是尺寸很重要
  • 是的,当然,看到真实数据中有 852 355 422 行,也许有人知道一种方法来做同样的事情,但数据如此庞大? ...
  • 因为它是一个data.table,你有没有试过data.table方法,即tab[, avg : mean(V3), scaf_name]

标签: r dataframe dplyr


【解决方案1】:

这是一个 tidyverse 已经知道的问题。 https://github.com/r-lib/vctrs/issues/1133 您绕过某个值的限制。他们必须修复它。 ... uint32_t. I thought about just making sure that we store this instead as a uint64_t ... 并举个例子 https://github.com/tidyverse/dplyr/issues/5291

我的解决方案是使用 data.table。

library(data.table)
dt = data.table(tab)
dt[,.(Avg_group=mean(V3),Length=last(V2)),by = .(scaf_name)]

【讨论】:

  • 我并没有真正理解这里的问题,这意味着我不能使用 dplyr 来解决这个问题,而 dplyr 团队实际上正在尝试纠正这个问题,对吗?
  • 是的,他们试图解决它。在问题不应该存在的地方使用 data.table 可能是更好的选择。
猜你喜欢
  • 2021-11-14
  • 2015-07-21
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2018-01-27
  • 2019-07-23
相关资源
最近更新 更多