将任意类的列转换为另一个 data.table 中匹配列的类答案

【问题标题】：Convert columns of arbitrary class to the class of matching columns in another data.table将任意类的列转换为另一个 data.table 中匹配列的类
【发布时间】：2015-12-04 15:32:25
【问题描述】：

问题：

我在 R 中工作。我希望 2 个 data.tables 的共享列（共享意味着相同的列名）具有匹配的类。我正在努力寻找一种将未知类的对象一般转换为另一个对象的未知类的方法。

更多上下文：

我知道如何在 data.table 中设置列的类，并且我知道关于as 函数。此外，这个问题并不完全是data.table 特定的，但是当我使用data.tables 时它经常出现。此外，假设所需的强制是可能的。

我有 2 个数据表。它们共享一些列名，这些列旨在表示相同的信息。对于表 A 和表 B 共享的列名，我希望 A 的类与 B 中的类匹配（或其他方式）。

示例data.tables：

A <- structure(list(year = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), stratum = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L, 15L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L, 15L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L, 15L)), .Names = c("year", "stratum"), row.names = c(NA, -45L), class = c("data.table", "data.frame"))

B <- structure(list(year = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3), stratum = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L, 15L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L, 15L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L, 15L), bt = c(-9.95187702337873, -9.48946944434626, -9.74178662514147, -5.36167545158338, -4.76405522202426, -5.41964239804882, -0.0807951335119085, 0.520481719699774, 0.0393874225863578, 5.40557402913123, 5.47927931969583, 5.37228402911139, 9.82774396910091, 9.89629694010177, 9.98105260936272, -9.82469892896284, -9.42530210357904, -9.66171049964775, -5.17540952901709, -4.81859082470115, -5.3577146169737, -0.0685310909609001, 0.441383303157166, -0.0105897444321987, 5.24205882775199, 5.65773605162835, 5.40217185632441, 9.90299445851434, 9.78883672575814, 9.98747998379124, -9.69843398105195, -9.31530717395811, -9.77406601252698, -4.83080164375344, -4.89056304189872, -5.3904000267275, -0.121508487954861, 0.493798577602088, -0.118550709142654, 5.23654772583187, 5.87760447006892, 5.22478092346285, 9.90949768116403, 9.85433376398086, 9.91619307289277), yr = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3)), .Names = c("year", "stratum", "bt", "yr"), row.names = c(NA, -45L), class = c("data.table", "data.frame"), sorted = c("year", "stratum"))

这是他们的样子：

> A  
    year stratum
 1:    1       1
 2:    1       2
 3:    1       3
 4:    1       4

> B
    year stratum          bt yr
 1:    1       1 -9.95187702  1
 2:    1       2 -9.48946944  1
 3:    1       3 -9.74178663  1
 4:    1       4 -5.36167545  1

以下是课程：

> sapply(A, class)
     year   stratum 
"integer" "integer"

> sapply(B, class)
     year   stratum        bt        yr 
"numeric" "integer" "numeric" "numeric"

手动，我可以通过以下方式完成所需的任务：

A[,year:=as.numeric(year)]

当只有 1 列要更改时，这很容易，您提前知道该列，并且您提前知道所需的类。如果需要，将任意列转换为给定类也很容易。我也知道如何将任意列转换为任何给定的类。

我的失败尝试：

（编辑：这确实有效；请参阅我的回答）

s2c <- function (x, type = "list") 
{
    as.call(lapply(c(type, x), as.symbol))
}

# In this case, I can assume all columns of A can be found in B
# I am also able to assume that the desired conversion is possible
B.class <- sapply(B[,eval(s2c(names(A)))], class) 
for(col in names(A)){
    set(A, j=col, value=as(A[[col]], B.class[col]))
}

但这仍然将年份列返回为"integer"，而不是"numeric"：

> sapply(A, class)
     year   stratum 
"integer" "integer"

上例中的问题是class(as(1L, "numeric")) 仍然返回"integer"。另一方面，class(as.numeric(1L)) 返回"numeric"；但是，我提前不知道需要as.numeric。

问题，重述：

当 columns 和 to/from classes 都不知道时，如何使列类匹配？

其他想法：

在某种程度上，问题主要是关于任意类匹配。我经常使用 data.table 遇到这个问题，因为它对类匹配非常直言不讳。例如，当需要插入适当类型的NA（NA_real_ 与 NA_character_ 等）时，我遇到了类似的问题，具体取决于列的类别（请参阅This Question 中的相关问题/问题）。

同样，这个问题可以看作是在事先不知道的任意类之间进行转换的一般问题。过去，我使用switch 编写函数来执行switch(class(x), double = as.numeric(...), character = as.character(...), ... 之类的操作，但这看起来很丑陋。我在 data.table 的上下文中提出这个问题的唯一原因是因为它是我最常遇到对此类功能的需求的地方。

【问题讨论】：

也许对他们每个人都做lapply(A, . %>% as.character %>% type.convert)或类似的事情。（没有库（magrittr），这是lapply(A, function(x) type.convert(as.character(x)))）。不过，这是一种非常粗暴的方式，并且会在花哨的课程中失败。
@nicola 我认为 OP 是说其中一个优先于另一个，是的。就像他们尝试的功能一样，将 A 切换到 B 的类（我认为）。
storage.mode 怎么样？例如storage.mode(df1$A)<-storage.mode(df2$A)（或类似的）。
Rbaat，GitHub 页面上的 FR 指向此链接会很棒！也许我们可以导出一个函数来让这些操作变得轻松......
如果这些实际上在文件中，您可以读取第一个文件A，然后在B 上尝试fread(file, colClasses = sapply(A, class)[match(names(B), names(A))])。当我尝试时，这很有效。

标签： r class data.table

【解决方案1】：

不是很优雅，但您可以像这样“构建”as.* 调用：

for (x in colnames(A)) { A[,x] <- eval( call( paste0("as.", class(B[,x])), A[,x]) )}

【讨论】：

data.table 接近（？）：for (col in names(A)) set(A, j=col, value=eval(call(paste0("as.",B.class[col]), A[[col]])))。有关B.class 和s2c 函数的定义，请参阅我的问题
FWIW，我会将其视为数据集的初步工作，我完全不知道使用 DT 调用是否真的会更快（没有对象在那里增长，除非你有非常大观察量不应该花费太多时间）。但我可能完全错了:)

【解决方案2】：

这是确保通用类的一种非常粗略的方法：

library(magrittr)

cols = intersect(names(A), names(B))
r    = rbindlist(list(A = A, B = B[, ..cols]), idcol = TRUE)
r[, (cols) := lapply(.SD, . %>% as.character %>% type.convert), .SDcols=cols]
B[, (cols) := r[.id=="B", ..cols]]
A[, (cols) := r[.id=="A", ..cols]]

sapply(A, class); sapply(B, class)
#      year   stratum 
# "integer" "integer" 
#      year   stratum        yr 
# "integer" "integer" "numeric"

我不喜欢这个解决方案：

我通常对 ID 使用全整数代码（例如 "00001"、"02995"），这会将它们强制转换为实际整数，这很糟糕。
谁知道这会对Date 或factor 这样的高级课程产生什么影响？我想，如果您在读入数据后立即执行此 col-classes 规范化，这并不重要。

数据：

# slightly tweaked from OP
A <- setDT(structure(list(year = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), stratum = c(1L, 2L, 
3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L, 15L, 1L, 2L, 3L, 4L, 5L, 
6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L, 15L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 
9L, 10L, 11L, 12L, 13L, 14L, 15L)), .Names = c("year", "stratum"), row.names = 
c(NA, -45L), class = c("data.frame")))

B <- setDT(structure(list(year = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 
3, 3, 3, 3), stratum = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 
14L, 15L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L, 15L, 1L, 
2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L, 15L), yr = c(1, 1, 1, 
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3)), .Names = c("year", "stratum", 
"yr"), row.names = c(NA, -45L), class = c("data.frame")))

评论。如果您对 magrittr 有意见，请使用 function(x) type.convert(as.character(x)) 代替 . %>% 位。

【讨论】：

所以，我在等着看更好的主意。只是扩展我对这个问题的原始评论。
供参考，here is the C source underlying type.convert，如果有人寻求灵感
“字符”中间体让我有点不安。我有一个特殊的情况需要 A 与 B 完美匹配，我知道它们应该匹配（A 是间接派生自 B，不知何故）；所以你关于“奇怪”课程的观点对我来说是正确的。不过我要试试，因为我很喜欢type.convert，之前不知道；也许会成功的。
这个使用showMethods(coerce)吗？鉴于其中的所有选项，似乎可以构建一个非常通用的转换方法，而不必使用中间字符。还在想。

【解决方案3】：

根据this question 中的讨论和this answer 中的 cmets，我想我可能做对了，只是遇到了一个奇怪的例外。

请注意，类不会改变，但技术性是它并不重要（对于我提出问题的特定用例）。下面我展示了我的“失败的方法”，但是通过合并，以及合并后的data.table 中的列的类，我们可以看到为什么这种方法有效：整数只会被提升。

s2c <- function (x, type = "list") 
{
    as.call(lapply(c(type, x), as.symbol))
}

# In this case, I can assume all columns of A can be found in B
# I am also able to assume that the desired conversion is possible
B.class <- sapply(B[,eval(s2c(names(A)))], class)
for(col in names(A)){
    set(A, j=col, value=as(A[[col]], B.class[col]))
}

# Below here is new from what I tried in question
AB <- data.table:::merge.data.table(A, B, all=T, by=c("stratum","year"))

sapply(AB, class)
  stratum      year        bt        yr 
"integer" "numeric" "numeric" "numeric"

虽然这个答案没有解决问题中的问题，但我想我会发帖指出，在许多情况下，无法将 "integer" 转换为 "numeric" 可能不是问题，所以这是一个简单明了的解决方案。

【讨论】：