根据类型有效替换大型数据集中的负值答案

【问题标题】：Efficiently replacing negative values in large datasets conditional on type根据类型有效替换大型数据集中的负值
【发布时间】：2018-09-18 08:35:02
【问题描述】：

在我的数据集中：

# A tibble: 240 x 1,415
   matchcode S001  S002  S002EVS S003  S003A S004  S006  S007  S007_01    S008  S009  S009A S010  S010_01 S010_02 S010_03 S010_04 S011  S012  S013  S013B S014  S015  S016  S017      S017A    
   <fct>     <dbl> <dbl> <dbl+l> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl+lbl>  <dbl> <fct> <fct> <dbl> <dbl+l> <dbl+l> <dbl+l> <dbl+l> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl+lbl> <dbl+lbl>
 1 "JPN 198~ 2     1     -4      392   392   -4     324    324 3920120324 -4    JP    JP     -4   -4      -4      -4      -4        -4  -4    -4    -4    -4    -4    -4    0.6789805 0.6789805
 2 "MEX 198~ 2     1     -4      484   484   -4     933   2130 4840120926 -4    MX    MX     -4   -4      -4      -4      -4        -4  -4    -4    -4    -4    -4    -4    1.1378840 1.1378840
 3 "HUN 198~ 2     1     -4      348   348   -4    1280   4321 3480121280 -4    HU    HU     -4   -4      -4      -4      -4        -4  -4    -4    -4    -4    -4    -4    1.0635516 1.0635516
 4 "AUS 198~ 2     1     -4       36    36   -4     973   5478  360120973 -4    AU    AU     -4   -4      -4      -4      -4        -4  -4    -4    -4    -4    -4    -4    0.9616138 0.9616138
 5 "ARG 198~ 2     1     -4       32    32   -4     874   6607  320120874 -4    AR    AR     -4   -4      -4      -4      -4        -4  -4    -4    -4    -4    -4    -4    0.9266260 0.9266260
 6 "FIN 198~ 2     1     -4      246   246   -4     385   7123 2460120385 -4    FI    FI     -4   -4      -4      -4      -4        -4  -4    -4    -4    -4    -4    -4    1.0000000 1.0000000
 7 "KOR 198~ 2     1     -4      410   410   -4       3   7744 4100120003 -4    KR    KR     -4   -4      -4      -4      -4        -4  -4    -4    -4    -4    -4    -4    1.0000000 1.0000000
 8 "ZAF 198~ 2     1     -4      710   710   -4    5420  10260 7100121549 -4    ZA    ZA     -4   -4      -4      -4      -4        -4  -4    -4    -4    -4    -4    -4    1.0000000 1.0000000
 9 "ARG 199~ 2     2     -4       32    32   -4     856  11163  320240856 -4    AR    AR    125   -4      -4      -4      -4      1210  -4     1    -4    -4    -4    -4    1.0000000 1.0000000
10 "BLR 199~ 2     2     -4      112   112   -4     106  11415 1120240106 -4    BY    BY     -4   -4      -4      -4      -4        -4  -4    -4    -4    -4    -4    -4    1.0000000 1.0000000

为了用 NA 替换所有负值，我使用了以下代码：

df [ df < 0 ] <- NA

然而，我只想在非字符的列上执行此操作（我想摆脱错误消息，而不抑制它们）。变量charcol 包含应该跳过的列的名称。我试过了：

df [-charcol] df [-charcol] < 0] <- NA

这给了我错误：

Error: cannot allocate vector of size 1.8 Gb

除了仍然给我警告之外：

In addition: Warning messages:
1: In Ops.factor(left, right) : ‘<’ not meaningful for factors

虽然我可能弄错了语法，但我想知道对于大型数据集的此类问题，最有效的解决方案是什么。我一直在看data.table vignette 一段时间，但我无法真正弄清楚如何执行语法。

有什么建议吗？

str(WVSsample)
Classes ‘data.table’ and 'data.frame':  240 obs. of  1415 variables:
 $ matchcode  : Factor w/ 240 levels "ALB 1998 ","ALB 2002 ",..: 108 134 88 12 4 73 117 232 5 25 ...
 $ S001       :Class 'labelled'  atomic [1:240] 2 2 2 2 2 2 2 2 2 2 ...
  .. ..- attr(*, "label")= chr "Study"
  .. ..- attr(*, "format.stata")= chr "%8.0g"
  .. ..- attr(*, "labels")= Named num [1:7] -5 -4 -3 -2 -1 1 2
  .. .. ..- attr(*, "names")= chr [1:7] "Missing; Unknown" "Not asked in survey" "Not applicable" "No answer" ...
 $ S002       :Class 'labelled'  atomic [1:240] 1 1 1 1 1 1 1 1 2 2 ...
  .. ..- attr(*, "label")= chr "Wave"
  .. ..- attr(*, "format.stata")= chr "%8.0g"
  .. ..- attr(*, "labels")= Named num [1:11] -5 -4 -3 -2 -1 1 2 3 4 5 ...
  .. .. ..- attr(*, "names")= chr [1:11] "Missing; Unknown" "Not asked in survey" "Not applicable" "No answer" ...
 $ S002EVS    :Class 'labelled'  atomic [1:240] -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 ...
  .. ..- attr(*, "label")= chr "EVS-wave"
  .. ..- attr(*, "format.stata")= chr "%8.0g"
  .. ..- attr(*, "labels")= Named num [1:9] -5 -4 -3 -2 -1 1 2 3 4
  .. .. ..- attr(*, "names")= chr [1:9] "Missing; Unknown" "Not asked in survey" "Not applicable" "No answer" ...
 $ S003       :Class 'labelled'  atomic [1:240] 392 484 348 36 32 246 410 710 32 112 ...
  .. ..- attr(*, "label")= chr "Country/region"
  .. ..- attr(*, "format.stata")= chr "%8.0g"
  .. ..- attr(*, "labels")= Named num [1:199] -5 -4 -3 -2 -1 4 8 12 16 20 ...
  .. .. ..- attr(*, "names")= chr [1:199] "Missing; Unknown" "Not asked in survey" "Not applicable" "No answer" ...
 $ S003A      :Class 'labelled'  atomic [1:240] 392 484 348 36 32 246 410 710 32 112 ...
  .. ..- attr(*, "label")= chr "Country/regions [with split ups]"
  .. ..- attr(*, "format.stata")= chr "%8.0g"
  .. ..- attr(*, "labels")= Named num [1:199] -5 -4 -3 -2 -1 4 8 12 16 20 ...
  .. .. ..- attr(*, "names")= chr [1:199] "Missing; Unknown" "Not asked in survey" "Not applicable" "No answer" ...
 $ S004       :Class 'labelled'  atomic [1:240] -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 ...
  .. ..- attr(*, "label")= chr "Set"
  .. ..- attr(*, "format.stata")= chr "%8.0g"
  .. ..- attr(*, "labels")= Named num [1:7] -5 -4 -3 -2 -1 1 2
  .. .. ..- attr(*, "names")= chr [1:7] "Missing; Unknown" "Not asked in survey" "Not applicable" "No answer" ...

编辑：@chinsoon12 提到使用以下代码：

f_dowle3 = function(DT) {
  for (j in seq_len(ncol(DT)))
    set(DT,which(is.na(DT[[j]])),j,0)
}

然而，这段代码并没有做两件事：

它将 NA 替换为零，而我想用 NA 替换负值。我需要将which(is.na(DT[[j]])) 部分更改为DT[[j]]) < 0。
不考虑字符列。

我把代码改成：

f_dowle3 = function(DT) {
  # or by number (slightly faster than by name) :
  for (j in seq_len(ncol(DT)))
    set(DT,which(DT[[j]]<0),j,NA)
}

但这会使数据集为 NULL。谁能帮我正确调整代码？

【问题讨论】：

能否分享示例数据集？
str(df) 也可能有用。
@SaurabhChauhan 完成
@SalmanLashkarara 完成
见stackoverflow.com/questions/7235657/…，简而言之，使用data.table::set函数

标签： r replace data.table

【解决方案1】：

由于这是一个骗子，将很快删除，因为不适合 cmets。

setDT(df)
cols <- names(df)[sapply(df, is.numeric)]
for (x in cols) {
    set(df, which(df[[x]] < 0), x, NA_real_)
}

【讨论】：

我现在正在尝试。也许我重写你给出答案的问题会更好？
真是个骗子。我似乎无法在 google 中找到相关的
从David arenburg's comment借来的：像df[,names(df) := lapply(.SD,function(x) { if (is.numeric(x) { x[x<0] <- NA_Real_ }; x })]这样的东西应该一次性完成
仅供参考，可能应该养成使用 sapply(df, is.numeric) 的习惯（在您对 cols 的定义中），由 Matt 和 Kevin 的 cmets 判断：stackoverflow.com/a/18835813
@Tom 在 is.numeric(x) 之后，我忘了关闭 if 条件：if (is.numeric(x))