R：在第一个分隔符出现时快速拆分字符串答案

【问题标题】：R: Fast string split on first delimiter occurenceR：在第一个分隔符出现时快速拆分字符串
【发布时间】：2014-10-08 18:08:51
【问题描述】：

我有一个包含约 4000 万行的文件，需要根据第一个逗号分隔符进行拆分。

以下使用stringr 函数str_split_fixed 效果很好但很慢。

library(data.table)
library(stringr)

df1 <- data.frame(id = 1:1000, letter1 = rep(letters[sample(1:25,1000, replace = T)], 40))
df1$combCol1 <- paste(df1$id, ',',df1$letter1, sep = '')
df1$combCol2 <- paste(df1$combCol1, ',', df1$combCol1, sep = '')

st1 <- str_split_fixed(df1$combCol2, ',', 2)

对更快的方法有什么建议吗？

【问题讨论】：

尝试从“stringr”包更改为“stringi”包，但基本函数可能更快。
regmatches(df1$combCol2, regexpr(",", df1$combCol2), invert = TRUE) 获取列表。你需要rbind它
stringi 摇滚。一致且闪电般的速度。

标签： regex r string split

【解决方案1】：

更新

最新版本的“stringi”中的stri_split_fixed 函数有一个simplify 参数，可以将其设置为TRUE 以返回一个矩阵。因此，更新后的解决方案是：

stri_split_fixed(df1$combCol2, ",", 2, simplify = TRUE)

原始答案（带有更新的基准）

如果您对“stringr”语法感到满意并且不想偏离它太远，但又想从速度提升中受益，请尝试使用“stringi”包：

library(stringr)
library(stringi)
system.time(temp1 <- str_split_fixed(df1$combCol2, ',', 2))
#    user  system elapsed 
#    3.25    0.00    3.25 
system.time(temp2a <- do.call(rbind, stri_split_fixed(df1$combCol2, ",", 2)))
#    user  system elapsed 
#    0.04    0.00    0.05 
system.time(temp2b <- stri_split_fixed(df1$combCol2, ",", 2, simplify = TRUE))
#    user  system elapsed 
#    0.01    0.00    0.01

大多数“stringr”函数都有“stringi”并行，但从这个例子可以看出，“stringi”输出需要一个额外的步骤来绑定数据以将输出创建为矩阵而不是列表.

这是它与 @RichardScriven 在 cmets 中的建议的比较：

fun1a <- function() do.call(rbind, stri_split_fixed(df1$combCol2, ",", 2))
fun1b <- function() stri_split_fixed(df1$combCol2, ",", 2, simplify = TRUE)
fun2 <- function() {
  do.call(rbind, regmatches(df1$combCol2, regexpr(",", df1$combCol2), 
                            invert = TRUE))
} 

library(microbenchmark)
microbenchmark(fun1a(), fun1b(), fun2(), times = 10)
# Unit: milliseconds
#     expr       min        lq      mean    median        uq       max neval
#  fun1a()  42.72647  46.35848  59.56948  51.94796  69.29920  98.46330    10
#  fun1b()  17.55183  18.59337  20.09049  18.84907  22.09419  26.85343    10
#   fun2() 370.82055 404.23115 434.62582 439.54923 476.02889 480.97912    10

【讨论】：

可以this新功能帮忙吗？它比 simplify2array 快 10 倍，并且能够从长度不等的向量列表中转换矩阵。也许我们应该在stri_split 和stri_extract 中添加一个simplify 参数来进行这样的输出到矩阵的转换（默认情况下=FALSE 是为了向后兼容）？使用新的stri_list2matrix 函数，我得到了 4 倍的加速 w.r.t。 do.call.
我会说是的，这很有帮助。可能是新的do.call(rbind, ...)
@RichardScriven：好的，work in progress