R data.table 奇怪的值/引用语义答案

【问题标题】：R data.table weird value/reference semanticsR data.table 奇怪的值/引用语义
【发布时间】：2020-07-07 12:22:49
【问题描述】：

（这是this的后续问题。）

检查这个玩具代码：

> x <- data.frame(a = 1:2)
> foo <- function(z) { setDT(z) ; z[, b:=3:4] ; z } 
> y <- foo(x)
> 
> class(x)
[1] "data.table" "data.frame"
> x
   a
1: 1
2: 2

看起来 setDT 确实改变了 x 的类，但添加的数据不适用于 x。
这里发生了什么？

【问题讨论】：

这里至少讨论了同一个问题的一些元素：github.com/Rdatatable/data.table/issues/4589
z 是对 x 的引用，直到 setDT。所以setDT 应用于x。如果您更改z 就像foo <- function(z) {z$b <- 3:4; setDT(z); z } 中的z 不再是对x 的引用并且setDT 不会更改x。查看输出：foo <- function(z) {print(address(z)); z}; address(x); y <- foo(x); address(y)
或者试试：x <- data.frame(a = 1:2); y <- x; setDT(y); class(x)
@GKi 如果您扩展该答案以包含相关词汇和逻辑（为什么会这样），将会很有趣。
这似乎相关stackoverflow.com/questions/26069219/…

标签： r data.table

【解决方案1】：

在您的函数中，z 是对 x 的引用，直到 setDT。

library(data.table)
foo <- function(z) {print(address(z)); setDT(z); print(address(z))} 
x <- data.frame(a = 1:2)
address(x)
#[1] "0x555ec9a471e8"
foo(x)
#[1] "0x555ec9a471e8"
#[1] "0x555ec9ede300"

在setDT 中，z 仍然指向与x 相同的地址：

setattr(z, "class", data.table:::.resetclass(z, "data.frame"))

setattr 不进行复制。所以x 和z 仍然指向同一个地址，并且现在都属于data.frame 类：

x <- data.frame(a = 1:2)
z <- x
class(x)
#[1] "data.frame"
address(x)
#[1] "0x555ec95de600"
address(z)
#[1] "0x555ec95de600"

setattr(z, "class", data.table:::.resetclass(z, "data.frame"))

class(x)
#[1] "data.table" "data.frame"
address(x)
#[1] "0x555ec95de600"
address(z)
#[1] "0x555ec95de600"

然后在这种情况下调用setalloccol：

assign("z", .Call(data.table:::Calloccolwrapper, z, 1024, FALSE))

现在让x 和z 指向不同的地址。

address(x)
#[1] "0x555ecaa09c00"
address(z)
#[1] "0x555ec95de600"

两者都有class data.frame

class(x)
#[1] "data.table" "data.frame"
class(z)
#[1] "data.table" "data.frame"

我想他们什么时候会使用

class(z) <- data.table:::.resetclass(z, "data.frame")

而不是

setattr(z, "class", data.table:::.resetclass(z, "data.frame"))

问题不会发生。

x <- data.frame(a = 1:2)
z <- x
address(x)
#[1] "0x555ec9cd2228"
class(z) <- data.table:::.resetclass(z, "data.frame")
class(x)
#[1] "data.frame"
class(z)
#[1] "data.table" "data.frame"
address(x)
#[1] "0x555ec9cd2228"
address(z)
#[1] "0x555ec9cd65a8"

但在class(z) <- value 之后z 不会指向它之前指向的相同地址：

z <- data.frame(a = 1:2)
address(z)
#[1] "0x5653dbe72b68"
address(z$a)
#[1] "0x5653db82e140"
class(z) <- c("data.table", "data.frame")
address(z)
#[1] "0x5653dbe82d98"
address(z$a)
#[1] "0x5653db82e140"

但在setDT 之后，它也不会指向它之前指向的相同地址：

z <- data.frame(a = 1:2)
address(z)
#[1] "0x55b6f04d0db8"
setDT(z)
address(z)
#[1] "0x55b6efe1e0e0"

正如@Matt-dowle 所指出的，也可以将x 中的数据更改为z：

x <- data.frame(a = c(1,3))
z <- x
setDT(z)
z[, b:=3:4]
z[2, a:=7]
z
#   a b
#1: 1 3
#2: 7 4
x
#   a
#1: 1
#2: 7

R.version.string
#[1] "R version 4.0.2 (2020-06-22)"
packageVersion("data.table")
#[1] ‘1.12.8’

【讨论】：

谢谢！这似乎是正确的答案 - 与 data.table 的作者的讨论在此处继续：github.com/Rdatatable/data.table/issues/4589。会更新。

【解决方案2】：

GKi 答案的补充：

setalloccol 的位置确实是直接的罪魁祸首：它执行浅拷贝（即，生成指向现有数据列的新指针向量），此外还为其他列分配额外的 1024 个（默认情况下）插槽。如果在这个浅拷贝之后（通过class(z)<- 或setattr）将类设置为data.frame，则它将应用于这个新向量而不是原始参数。

但是。

即使使用了 setDT 的固定版本（setattr 在setalloccol 之后调用），似乎也无法获得一致的行为。有些操作适用于调用者副本，有些则不适用。

df <- data.frame(a=1:2, b=3:4)

foo1 <- function(z) { 
  setDT.fixed(z)
  z[, b:=5]   # will apply to the caller copy
  data.table::setDF(z)
}

foo1(df)
#    a b
# 1: 1 5
# 2: 2 5
class(df)
# [1] "data.frame"
df
#   a b
# 1 1 5
# 2 2 5

foo2 <- function(z) { 
  setDT.fixed(z)
  z[, c:=5]   # will NOT apply to the caller copy
  data.table::setDF(z)
}
foo2(df)
#    a b c
# 1: 1 3 5
# 2: 2 4 5
# Warning message:
# In `[.data.table`(z, , `:=`(c, 5)) :
#  Invalid .internal.selfref detected and fixed by taking a (shallow) copy of the data.table so that := can add this new column by reference. At an earlier point, this data.table has been copied by R (or was created manually using structure() or similar). Avoid names<- and attr<- which in R currently (and oddly) may copy the whole data.table. Use set* syntax instead to avoid copying: ?set, ?setnames and ?setattr. If this message doesn't help, please report your use case to the data.table issue tracker so the root cause can be fixed or this message improved.
class(df)
# [1] "data.table" "data.frame"
df
#    a b
# 1: 1 3
# 2: 2 4

（使用 j 参数，例如，z[!is.na(a), b:=6] 给出了一个额外的怪异维度，我不会在这里讨论）。

归根结底，data.table 包承担了在 R 的全值语义上打洞的勇敢任务。在 setDT 出现之前它非常成功（顺便说一句，在这里回答一个 SO 问题）。在函数中在参数上使用 setDT 可能永远不会有一致的语义，并且几乎肯定会给你带来令人讨厌的惊喜。

【讨论】：

【解决方案3】：

library(data.table)

x <- data.frame(a = 1:2)
y <- x                #y is a reference to x
address(x)
#[1] "0x55e07e31a1e8"
address(y)
#[1] "0x55e07e31a1e8"
setDT(y)              #Add data.table to attr of y AND x, create a copy of it and let y point to it and make y a DT
address(x)
#[1] "0x55e07e31a1e8"
address(y)
#[1] "0x55e07e7b1300"
class(x)
#[1] "data.table" "data.frame"

x[, b:=3:4]
#Warnmeldung:
#In `[.data.table`(x, , `:=`(b, 3:4)) :
#  Invalid .internal.selfref detected and fixed by taking a (shallow) copy of the data.table so that := can add this new column by reference. At an earlier point, this data.table has been copied by R (or was created manually using structure() or similar). Avoid names<- and attr<- which in R currently (and oddly) may copy the whole data.table. Use set* syntax instead to avoid copying: ?set, ?setnames and ?setattr. If this message doesn't help, please report your use case to the data.table issue tracker so the root cause can be fixed or this message improved.

z <- data.frame(a = 1:2)
class(z) <- c("data.table", "data.frame")
z[, b:=3:4]
#Warnmeldung:
#In `[.data.table`(x, , `:=`(b, 3:4)) :
#  Invalid .internal.selfref detected and fixed by taking a (shallow) copy of the data.table so that := can add this new column by reference. At an earlier point, this data.table has been copied by R (or was created manually using structure() or similar). Avoid names<- and attr<- which in R currently (and oddly) may copy the whole data.table. Use set* syntax instead to avoid copying: ?set, ?setnames and ?setattr. If this message doesn't help, please report your use case to the data.table issue tracker so the root cause can be fixed or this message improved.

【讨论】：

请注意，即使这似乎与文档相矛盾：rdocumentation.org/packages/data.table/versions/1.12.8/topics/… 说“n data.table 用语，所有 set* 函数都通过引用更改其输入。也就是说，根本没有复制，” .我怀疑（通过 GH 讨论）只制作了 shallow 副本，但无法验证。无论如何，这是否解释了问题中的行为？
@OfekShilon 该副本由 R 而非 data.table 制作。但是副本是在 data.table 生成 setattr 之后生成的 - 所以在我们的例子中，x 和 y 都得到了 data.table 类。
@OfekShilon 其实我认为这是DT中的一个bug，因为x声称只是一个DT，但事实并非如此！
Oliver 在回答链接问题时讨论了这个问题：stackoverflow.com/a/62742393/89706。我不认为这本身就是一个错误。
@Gki，为了稍微了解一下setDT() 正在做什么，我相信setDT() 通过引用修改了类，但只对传递给setDT() 的对象的列进行了overloates。这就是为什么您会收到 .internal.selfref 消息/该类是 data.table。