【问题标题】:Dangers of mixing [tidyverse] and [data.table] syntax in R?在 R 中混合 [tidyverse] 和 [data.table] 语法的危险?
【发布时间】:2021-08-28 13:55:08
【问题描述】:

混合tidyversedata.table 语法时,我得到了一些非常奇怪的行为。 对于上下文,我经常发现自己使用tidyverse 语法,然后在我需要速度与需要代码可读性时将管道添加回data.table。我知道 Hadley 正在开发一个使用 tidyverse 语法和 data.table 速度的新包,但据我所知,它仍处于初期阶段,所以我没有使用它。

有人愿意解释这里发生了什么吗?这对我来说非常可怕,因为我可能已经不假思索地做了数千次。

library(dplyr); library(data.table)
DT <-
  fread(
    "iso3c  country income
MOZ Mozambique  LIC
ZMB Zambia  LMIC
ALB Albania UMIC
MOZ Mozambique  LIC
ZMB Zambia  LMIC
ALB Albania UMIC
"
  )

codes <- c("ALB", "ZMB")

# now, what happens if I use a tidyverse function (distinct) and then
# convert back to data.table?
DT <- distinct(DT) %>% as.data.table()

# this works like normal
DT[iso3c %in% codes]
# iso3c country income
# 1:   ZMB  Zambia   LMIC
# 2:   ALB Albania   UMIC

# now, what happens if I use a different tidyverse function (arrange) 
# and then convert back to data.table?
DT <- DT %>% arrange(iso3c) %>% as.data.table()

# this is wack: (!!!!!!!!!!!!)
DT[iso3c %in% codes]
# iso3c country income
# 1:   ALB Albania   UMIC

# but these work:
DT[(iso3c %in% codes), ]
# iso3c country income
# 1:   ZMB  Zambia   LMIC
# 2:   ALB Albania   UMIC
DT[DT$iso3c %in% codes, ]
# iso3c country income
# 1:   ZMB  Zambia   LMIC
# 2:   ALB Albania   UMIC
DT[DT$iso3c %in% codes]
# iso3c country income
# 1:   ZMB  Zambia   LMIC
# 2:   ALB Albania   UMIC

【问题讨论】:

    标签: r dplyr data.table tidyverse


    【解决方案1】:

    我曾多次遇到同样的问题,这导致我避免将dplyrdata.table 语法混用,因为我没有花时间找出原因。感谢您提供 MRE。

    看起来dplyr::arrange 正在干扰data.table auto-indexing

    • 使用== 子集数据集时将使用索引 或%in% 在单个变量上
    • 默认情况下,如果过滤时变量的索引不存在,则会自动创建和使用它
    • 如果更改数据顺序,索引会丢失
    • 你可以用options(datatable.verbose=TRUE)检查你是否使用索引

    如果我们明确设置自动索引:

    library(dplyr); 
    library(data.table)
    
    DT <- fread(
    "iso3c  country income
    MOZ Mozambique  LIC
    ZMB Zambia  LMIC
    ALB Albania UMIC
    MOZ Mozambique  LIC
    ZMB Zambia  LMIC
    ALB Albania UMIC")
    codes <- c("ALB", "ZMB")
    
    options(datatable.auto.index = TRUE)
    
    DT <- distinct(DT) %>%   as.data.table()
    
    # Index creation because %in% is used for the first time
    DT[iso3c %in% codes,verbose=T]
    #> Creating new index 'iso3c'
    #> Creating index iso3c done in ... forder.c received 3 rows and 3 columns
    #> forder took 0 sec
    #> 0.060s elapsed (0.060s cpu) 
    #> Optimized subsetting with index 'iso3c'
    #> forder.c received 2 rows and 1 columns
    #> forder took 0 sec
    #> x is already ordered by these columns, no need to call reorder
    #> i.iso3c has same type (character) as x.iso3c. No coercion needed.
    #> on= matches existing index, using index
    #> Starting bmerge ...
    #> bmerge done in 0.000s elapsed (0.000s cpu) 
    #> Constructing irows for '!byjoin || nqbyjoin' ... 0.000s elapsed (0.000s cpu) 
    #> Reordering 2 rows after bmerge done in ... forder.c received a vector type 'integer' length 2
    #> 0 secs
    #>    iso3c country income
    #> 1:   ZMB  Zambia   LMIC
    #> 2:   ALB Albania   UMIC
    
    # Index mixed up by arrange
    DT <- DT %>% arrange(iso3c) %>% as.data.table()
    
    # this is wack because data.table possibly still uses the old index whereas row/references were rearranged:
    DT[iso3c %in% codes,verbose=T]
    #> Optimized subsetting with index 'iso3c'
    #> forder.c received 2 rows and 1 columns
    #> forder took 0 sec
    #> x is already ordered by these columns, no need to call reorder
    #> i.iso3c has same type (character) as x.iso3c. No coercion needed.
    #> on= matches existing index, using index
    #> Starting bmerge ...
    #> bmerge done in 0.000s elapsed (0.000s cpu) 
    #> Constructing irows for '!byjoin || nqbyjoin' ... 0.000s elapsed (0.000s cpu)
    #>    iso3c country income
    #> 1:   ALB Albania   UMIC
    
    # this works because (...) prevents the parser to use auto-index
    DT[(iso3c %in% codes)]
    #>    iso3c country income
    #> 1:   ALB Albania   UMIC
    #> 2:   ZMB  Zambia   LMIC
    

    为避免此问题,您可以禁用自动索引:

    library(dplyr); 
    library(data.table)
    
    DT <- fread(
    "iso3c  country income
    MOZ Mozambique  LIC
    ZMB Zambia  LMIC
    ALB Albania UMIC
    MOZ Mozambique  LIC
    ZMB Zambia  LMIC
    ALB Albania UMIC")
    codes <- c("ALB", "ZMB")
    
    options(datatable.auto.index = FALSE) # Disabled
    
    DT <- distinct(DT) %>%   as.data.table()
    
    # No automatic index creation
    DT[iso3c %in% codes,verbose=T]
    #>    iso3c country income
    #> 1:   ZMB  Zambia   LMIC
    #> 2:   ALB Albania   UMIC
    
    DT <- DT %>% arrange(iso3c) %>% as.data.table()
    
    # This now works because auto-indexing is off:
    DT[iso3c %in% codes,verbose=T]
    #>    iso3c country income
    #> 1:   ALB Albania   UMIC
    #> 2:   ZMB  Zambia   LMIC
    
    

    我在data.table/issues/5042dtplyr/issues/259 上报告了这个问题:集成在1.4.11 milestone 中。

    【讨论】:

    【解决方案2】:

    使用tidytable 包不会发生这种情况(见下文)。现在是available on CRAN。 tidytable 允许您使用 tidyverse 语法进行最小程度的更改(distinct.arrange.),同时获得 data.table 的速度,这似乎是 OP 总体上想要的(谁不想要!)。

    library(data.table)
    library(tidytable)
    
    
    
    DT <-
      fread(
        "iso3c  country income
    MOZ Mozambique  LIC
    ZMB Zambia  LMIC
    ALB Albania UMIC
    MOZ Mozambique  LIC
    ZMB Zambia  LMIC
    ALB Albania UMIC
    "
      )
    
    codes <- c("ALB", "ZMB")
    
    DT <- distinct.(DT) %>% as.data.table()
    
    # this works like normal
    DT[iso3c %in% codes]
    #>    iso3c country income
    #> 1:   ZMB  Zambia   LMIC
    #> 2:   ALB Albania   UMIC
    
    DT <- DT %>% arrange.(iso3c) %>% as.data.table()
    
    # this is no longer wack
    DT[iso3c %in% codes]
    #>    iso3c country income
    #> 1:   ALB Albania   UMIC
    #> 2:   ZMB  Zambia   LMIC
    
    # and these work as normal:
    DT[(iso3c %in% codes), ]
    #>    iso3c country income
    #> 1:   ALB Albania   UMIC
    #> 2:   ZMB  Zambia   LMIC
    
    DT[DT$iso3c %in% codes, ]
    #>    iso3c country income
    #> 1:   ALB Albania   UMIC
    #> 2:   ZMB  Zambia   LMIC
    
    DT[DT$iso3c %in% codes]
    #>    iso3c country income
    #> 1:   ALB Albania   UMIC
    #> 2:   ZMB  Zambia   LMIC
    

    【讨论】:

      猜你喜欢
      • 2012-08-15
      • 1970-01-01
      • 1970-01-01
      • 2011-07-17
      • 2011-11-08
      • 2013-05-12
      • 1970-01-01
      • 2020-09-12
      • 2011-07-28
      相关资源
      最近更新 更多