在 R data.table 中快速操作字符数组答案

【问题标题】：Manipulating character arrays quickly in R data.table [duplicate]在 R data.table 中快速操作字符数组
【发布时间】：2018-12-18 07:09:24
【问题描述】：

我有一个巨大的字符向量数据集（14GB，2 亿行）。我很害怕（在 48 核 128 GB 服务器上花了 > 30 分钟）。该字符串包含有关各个字段的串联信息。例如，我的表格的第一行如下所示：

2014120900000001091500bbbbcompany_name00032401

其中前 8 个字符代表 YYYYMMDD 格式的 date，接下来的 8 个字符是 id，接下来的 6 个字符是 HHMMSS 格式的 time，接下来的 16 个字符是 name（以 b 为前缀），最后 8 个字符是price（小数点后两位）。

我需要把上面的1列data.table转成5列：date, id, time, name, price。

对于上面的字符向量，结果是：date = "2014-12-09", id = 1, time = "09:15:00", name = "company_name", price = 324.01

我正在寻找一个（非常）快速高效的 dplyr / data.table 解决方案。现在我正在使用substr：

date = as.Date(substr(d, 1, 8), "%Y%m%d");

它需要永远执行！

更新：使用readr::read_fwf 我可以在 5-10 分钟内读取文件。显然，阅读速度比fread 快。下面是代码：

f = "file_name";
num_cols = 5;
col_widths = c(8,8,6,16,8);
col_classes = "ciccn";
col_names = c("date", "id", "time", "name", "price");

# takes 5-10 mins
data = readr::read_fwf(file = f, col_positions = readr::fwf_widths(col_widths, col_names), col_types = col_classes, progress = T);

setDT(data);
# object.size(data) / 2^30; # 17.5 GB

【问题讨论】：

你试过read.fwf直接把固定的子串读入不同的列吗？
相关：Faster way to read fixed-width files
@Henrik 感谢您的指出。我不知道固定宽度读数（这正是我想要的）。

标签： arrays r data.table

【解决方案1】：

一个可能的解决方案：

library(data.table)
library(stringi)

widths <- c(8,8,6,16,8)
sp <- c(1, cumsum(widths[-length(widths)]) + 1)
ep <- cumsum(widths)

DT[, lapply(seq_along(sp), function(i) stri_sub(V1, sp[i], ep[i]))]

给出：

         V1       V2     V3               V4       V5
1: 20141209 00000001 091500 bbbbcompany_name 00032401

包括一些额外的处理以获得期望的结果：

DT[, lapply(seq_along(sp), function(i) stri_sub(V1, sp[i], ep[i]))
   ][, .(date = as.Date(V1, "%Y%m%d"),
         id = as.integer(V2),
         time = as.ITime(V3, "%H%M%S"),
         name = sub("^(bbbb)","",V4),
         price = as.numeric(V5)/100)]

给出：

         date id     time         name  price
1: 2014-12-09  1 09:15:00 company_name 324.01

但您实际上是在读取一个固定宽度的文件。所以也可以考虑来自 base R 的 read.fwf 或来自 readr 的 read_fwf 或编写自己的 fread.fwf 函数，就像我刚才所做的那样：

fread.fwf <- function(file, widths, enc = "UTF-8") {
  sp <- c(1, cumsum(widths[-length(widths)]) + 1)
  ep <- cumsum(widths)
  fread(file = file, header = FALSE, sep = "\n", encoding = enc)[, lapply(seq_along(sp), function(i) stri_sub(V1, sp[i], ep[i]))]
}

使用过的数据：

DT <- data.table(V1 = "2014120900000001091500bbbbcompany_name00032401")

【讨论】：

感谢@Jaap 的解决方案。我试过readr::read_fwf，这给了我令人满意的表现。我还没有尝试你的解决方案。

【解决方案2】：

也许你的解决方案还不错。

我正在使用这些数据：

df <- data.table(text = rep("2014120900000001091500bbbbcompany_name00032401", 100000))

您的解决方案：

> system.time(df[, .(date = as.Date(substr(text, 1, 8), "%Y%m%d"),
+                    id = as.integer(substr(text, 9, 16)),
+                    time = substr(text, 17, 22),
+                    name = substr(text, 23, 38),
+                    price = as.numeric(substr(text, 39, 46))/100)])
   user  system elapsed 
   0.17    0.00    0.17

@Jaap 解决方案：

> library(data.table)
> library(stringi)
> 
> widths <- c(8,8,6,16,8)
> sp <- c(1, cumsum(widths[-length(widths)]) + 1)
> ep <- cumsum(widths)
> 
> system.time(df[, lapply(seq_along(sp), function(i) stri_sub(text, sp[i], ep[i]))
+    ][, .(date = as.Date(V1, "%Y%m%d"),
+          id = as.integer(V2),
+          time = V3,
+          name = sub("^(bbbb)","",V4),
+          price = as.numeric(V5)/100)])
   user  system elapsed 
   0.20    0.00    0.21

read.fwf 的尝试：

> setClass("myDate")
> setAs("character","myDate", function(from) as.Date(from, format = "%Y%m%d"))
> setClass("myNumeric")
> setAs("character","myNumeric", function(from) as.numeric(from)/100)
> 
> ff <- function(x) {
+   file <- textConnection(x)
+   read.fwf(file, c(8, 8, 6, 16, 8),
+            col.names = c("date", "id", "time", "name", "price"),
+            colClasses = c("myDate", "integer", "character", "character", "myNumeric"))
+ }
> 
> system.time(df[, as.list(ff(text))])
   user  system elapsed 
   2.33    6.15    8.49

所有输出都相同。

【讨论】：

【解决方案3】：

也许尝试使用带数字的矩阵而不是 data.frame。聚合应该花费更少的时间。

【讨论】：