添加一个 data.table 列，该列告诉其他 C 列之一是否包含某个值答案

【问题标题】：Add a data.table column that tells whether one of C other columns contain some value添加一个 data.table 列，该列告诉其他 C 列之一是否包含某个值
【发布时间】：2019-04-22 18:17:59
【问题描述】：

假设我有一个data.table，其中 C 列保存 N 个可能值中的离散值：

set.seed(123)
datapoints = data.table(replicate(3, sample(0:5, 4, rep=TRUE)))
print(datapoints)
   V1 V2 V3
1:  1  5  3
2:  4  0  2
3:  2  3  5
4:  5  5  2

（这里 C=3 和 N=5）

我想添加N列，如果C列之一包含第N个值，则每列包含TRUE，否则为FALSE：

   V1 V2 V3  has0  has1  has2  has3  has4  has5
1:  1  5  3 FALSE  TRUE FALSE  TRUE FALSE  TRUE
2:  4  0  2  TRUE FALSE  TRUE FALSE  TRUE FALSE
3:  2  3  5 FALSE FALSE  TRUE  TRUE FALSE  TRUE
4:  5  5  2 FALSE FALSE  TRUE FALSE FALSE  TRUE

我试过这个：

for (value in 0:5) {
  datapoints <- datapoints[, (paste("has", value, sep="")) := (value %in% .SD), .SDcols = c("V1", "V2", "V3")]
}

列已添加但填充为FALSE：

   V1 V2 V3  has0  has1  has2  has3  has4  has5
1:  1  5  3 FALSE FALSE FALSE FALSE FALSE FALSE
2:  4  0  2 FALSE FALSE FALSE FALSE FALSE FALSE
3:  2  3  5 FALSE FALSE FALSE FALSE FALSE FALSE
4:  5  5  2 FALSE FALSE FALSE FALSE FALSE FALSE

在我看来，如果我将 .SD 替换为对当前行（而不是整个表）的引用，代码会起作用，但我不知道该怎么做。

添加这些列的有效方法是什么？

【问题讨论】：

请发布一些数据，我们可以用来复制错误并更好地帮助您。
我添加了一个可重现的例子。
相关：How to one-hot-encode factor variables with data.table?

标签： r data.table

【解决方案1】：

这是一种方法

library(data.table)

# sample data
set.seed(123)
datapoints = data.table(replicate(3, sample(0:5, 4, rep=TRUE)))

# find if value exists
for(value in 0:5) {
  datapoints[, paste("has", value, sep="") := apply(.SD, 1, function(x) any(x %in% value)), .SDcols = c("V1", "V2", "V3")]
}

datapoints
#>    V1 V2 V3  has0  has1  has2  has3  has4  has5
#> 1:  1  5  3 FALSE  TRUE FALSE  TRUE FALSE  TRUE
#> 2:  4  0  2  TRUE FALSE  TRUE FALSE  TRUE FALSE
#> 3:  2  3  5 FALSE FALSE  TRUE  TRUE FALSE  TRUE
#> 4:  5  5  2 FALSE FALSE  TRUE FALSE FALSE  TRUE

为了更灵活，您还可以将any(x %in% value) 替换为sum(x %in% value) 以获取该值每行出现的次数。同一个例子

# find how many times a value exists
for(value in 0:5) {
  datapoints[, paste("has", value, sep="") := apply(.SD, 1, function(x) sum(x %in% value)), .SDcols = c("V1", "V2", "V3")]
}

datapoints
#>    V1 V2 V3 has0 has1 has2 has3 has4 has5
#> 1:  1  5  3    0    1    0    1    0    1
#> 2:  4  0  2    1    0    1    0    1    0
#> 3:  2  3  5    0    0    1    1    0    1
#> 4:  5  5  2    0    0    1    0    0    2

当然，如果您只需要列的子集，您仍然可以使用 .SDcols。

【讨论】：