以编程方式拆分非分隔字符串并生成新列答案

【问题标题】：Programmatically split non-delimited strings and generate new columns以编程方式拆分非分隔字符串并生成新列
【发布时间】：2018-01-31 16:43:36
【问题描述】：

我有一个包含非分隔字符串的 1 列数据表，像这样

d1 = data.table(x = c("2728661941-1945", "2657461921-1925", "2786161921-1925"))
d1
#>                  x
#> 1: 2728661941-1945
#> 2: 2657461921-1925
#> 3: 2786161921-1925

我有另一个表格的数据表

dic = data.table(field = c("ID","group","year"),start=c(1,6,7), length=c(5,1,9))
dic
#>    field start length
#> 1:    ID     1      5
#> 2: group     6      1
#> 3:  year     7      9

我想用dic中的信息拆分数据表d1中的字符串，得到一个新的表单数据框

d2 = data.table(ID = c("27286", "26574", "27861"), 
                group = c(6, 6, 6),
                year = c("1941-1945", "1921-1925", "1921-1925")
d2
#>       ID group      year
#> 1: 27286     6 1941-1945
#> 2: 26574     6 1921-1925
#> 3: 27861     6 1921-1925

我试过了

d2 = copy(d1)[,(dic$field) := transpose(
            lapply(x, stri_sub, from = dic$start, length = dic$length))]

但是，下面的数据是列表形式，而不是真正的表格形式。我希望能够将创建的字段称为列。

我不得不承认我并不完全确定自己在做什么，而且我真的不必为此使用数据表，但我想不出另一种方法来做到这一点。我拥有的最简单的数据集包含 79 个字符的字符串，并且会生成 25 个字段，因此我不想单独提取每个字段。

我希望这是有道理的。任何建议表示赞赏。

【问题讨论】：

标签： r split data.table

【解决方案1】：

1) read.fwf 试试read.fwf。没有使用任何包。

read.fwf(textConnection(d1$x), dic$length, col.names = dic$field)

给予：

     ID group      year
1 27286     6 1941-1945
2 26574     6 1921-1925
3 27861     6 1921-1925

2) 分开这也有效并给出相同的答案：

library(tidyr)

d1 %>% 
   separate(x, sep = dic$start - 1, into = dic$field, remove = TRUE)

【讨论】：

【解决方案2】：

正则表达式在这里很有用，特别是因为您可以以编程方式定义要搜索和输出的模式

d1 %>% 
    mutate(x=gsub(paste0("(.{", dic$length, "})", collapse=""), paste0("\\", seq_along(dic$length), collapse=" "), x)) %>% 
    separate(x, into=dic$field, sep=" ")

     # ID group      year
# 1 27286     6 1941-1945
# 2 26574     6 1921-1925
# 3 27861     6 1921-1925

解释

# Pattern to search for
paste0("(.{", dic$length, "})", collapse="")
# "(.{5})(.{1})(.{9})"
      # (.{5}) - group that contains any 5 characters - will be group 1
      # (.{1}) - group that contains any 1 character - will be group 2
      # (.{9}) - group that contains any 9 characters - will be group 3

# Pattern to output
paste0("\\", seq_along(dic$length), collapse=" ")
# "\\1 \\2 \\3"
      # \\1 - output group 1
      # \\2 - output group 2
      # each group is separated by a space

使用tidyr::separate 将生成的以空格分隔的字符串拆分为不同的字段

【讨论】：

【解决方案3】：

不使用dic 表，但这可以通过tidyr 中的extract 轻松完成：

library(tidyr)
extract(d1, x, c("ID", "group", "year"), "^(.{5})(.{1})(.{9})$")

结果：

      ID group      year
1: 27286     6 1941-1945
2: 26574     6 1921-1925
3: 27861     6 1921-1925

【讨论】：

感谢@user，但这似乎需要手动指定字段名称和字符长度。在这个例子中效果很好，但是我有更多的字段和更长的字符串。

【解决方案4】：

使用dic 表作为参考：

library(dplyr)

breaks <- setNames(as.list(paste0("substr(x", ", ", dic$start, ", ", dic$start+dic$length-1, ")")), dic$field)

d1 %>% 
  mutate_(.dots = breaks)

【讨论】：

【解决方案5】：

setNames(data.frame(do.call(rbind, lapply(d1$x, function(X) sapply(1:NROW(dic),
   function(i) c(substring(X, dic$start[i], dic$start[i] + dic$length[i])))))), dic$field)
#      ID group      year
#1 272866    61 1941-1945
#2 265746    61 1921-1925
#3 278616    61 1921-1925

【讨论】：

【解决方案6】：

我们可以使用基础 R 中的 strcapture 函数从技术上捕获字符串。我们将把它输入到一个预定义的数据框中。

 strcapture("(\\d{5})(\\d)(.*)",d1$x,data.frame(Id=numeric(),group=numeric(),year=character()))
     Id group      year
1 27286     6 1941-1945
2 26574     6 1921-1925
3 27861     6 1921-1925

解释：(\\d{5}) 捕获前 5 个数字，然后 (\\d) 捕获下一个数字，(.*) 捕获之后的所有数字。

【讨论】：

请在答案中提供适当的细节和建议以便更好地理解。