例子:
my20chars = c(LETTERS[1:10], 0:9)
set.seed(1)
x = replicate(1e4, paste0(sample(c(my20chars,"-"),10, replace=TRUE), collapse=""))
一种方法:
library(data.table)
d = setDT(stack(strsplit(setNames(x,x),"")))
dcast(d[ values %in% my20chars ], ind ~ values, fun = length)
结果:
ind 0 1 2 3 4 5 6 7 8 9 A B C D E F G H I J
1: ---8EEAD8I 0 0 0 0 0 0 0 0 2 0 1 0 0 1 2 0 0 0 1 0
2: --33B6E-32 0 0 1 3 0 0 1 0 0 0 0 1 0 0 1 0 0 0 0 0
3: --3IFBG8GI 0 0 0 1 0 0 0 0 1 0 0 1 0 0 0 1 2 0 2 0
4: --4210I8H5 1 1 1 0 1 1 0 0 1 0 0 0 0 0 0 0 0 1 1 0
5: --5H4DE9F- 0 0 0 0 1 1 0 0 0 1 0 0 0 1 1 1 0 1 0 0
---
9996: JJFJBJ24AJ 0 0 1 0 1 0 0 0 0 0 1 1 0 0 0 1 0 0 0 5
9997: JJI-J-0FGB 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 1 0 1 3
9998: JJJ1B54H63 0 1 0 1 1 1 1 0 0 0 0 1 0 0 0 0 0 1 0 3
9999: JJJED7A3FI 0 0 0 1 0 0 0 1 0 0 1 0 0 1 1 1 0 0 1 3
10000: JJJIF6GI13 0 1 0 1 0 0 1 0 0 0 0 0 0 0 0 1 1 0 2 3
基准测试:
library(microbenchmark)
nstrs = 1e5
nchars = 10
x = replicate(nstrs, paste0(sample(c(my20chars,"-"), nchars, replace=TRUE), collapse=""))
microbenchmark(
dcast = {
d = setDT(stack(strsplit(setNames(x,x),"")))
dcast(d[ values %in% my20chars ], ind ~ values, fun = length, value.var="ind")
},
times = 10)
# Unit: seconds
# expr min lq mean median uq max neval
# dcast 3.112633 3.423935 3.480692 3.494176 3.573967 3.741931 10
因此,这不足以处理 OP 的 7500 万个字符串,但可能是一个不错的起点。