R如何使用格式代码/字符串将一列字符串拆分为多列？答案

【问题标题】：R How to split a column of strings into multiple columns using a format code/string?R如何使用格式代码/字符串将一列字符串拆分为多列？
【发布时间】：2019-09-10 00:52:25
【问题描述】：

我正在处理人口普查 (CTPP) 数据，GEOID 字段是一个包含大量地理信息的长字符串。此字符串的格式因各种人口普查表而异，但它们提供代码查找。这是一个示例 GEOID 和格式“代码”。（我已经可以解析的部分已被删除。这是我无法解析的GEOID部分。）

geoid <- "0202000000126"
format <- "ssccczzzzzzzz"

这意味着前两个字符 ("02") 表示州（阿拉斯加），接下来的三个 ("020") 是县，其余字符是地区。

我有一张包含这些大地水准面/格式对的表格，每一行的格式可以不同。

s: 状态
c: 县
p：地点
z：区域
（本简单示例中未使用的其他）

df <- data.frame(
  geoid = c(
    "0224230",
    "0202000000126"
  ),
  format = c(
    "ssppppp",
    "ssccczzzzzzzz"
  )
)

# A tibble: 2 x 2
  geoid         format       
  <chr>         <chr>        
1 0224230       ssppppp      
2 0202000000126 ssccczzzzzzzz

我想做的是将geoid 列分解为每个地理区域的列，如下所示：

# A tibble: 2 x 6
  geoid         format        s     p     c     z       
  <chr>         <chr>         <chr> <chr> <chr> <chr>   
1 0224230       ssppppp       02    24230 NA    NA      
2 0202000000126 ssccczzzzzzzz 02    NA    020   00000126

我研究了几种方法。来自stringr 的extract() 看起来很有希望。我也很确定我需要一个自定义函数来映射（？）/映射到我的数据框。

【问题讨论】：

您好，您可以将数据框声明从data_frame 修改为data.frame。

标签： r string dplyr tidyr stringr

【解决方案1】：

base 替代方案：

geo_codes <- c("s", "c", "p", "z")

# get starting position and lengths of consecutive characters in 'format'
g <- gregexpr("(.)\\1+", df$format)

# use the result above to extract corresponding substrings from 'geoid' 
geo <- regmatches(df$geoid, g)

# select first element in each run of 'format' and split
# used to name substrings from above
fmt <- strsplit(gsub("(.)\\1+", "\\1", df$format), "")

# for each element in 'geo' and 'fmt',
# 1. create a named vector
# 2. index the vector with 'geo_codes' 
# 3. set names of the full length vector
t(mapply(function(geo, fmt){
  setNames(setNames(geo, fmt)[geo_codes], geo_codes)},
  geo, fmt))
#      s    c     p       z         
# [1,] "02" NA    "24230" NA        
# [2,] "02" "020" NA      "00000126"

另一种选择，

geo <- strsplit(df$geoid, "")
fmt <- strsplit(df$format, "")

t(mapply(function(geo, fmt) unlist(lapply(split(geo, factor(fmt, levels = geo_codes)), function(x){
  if(length(x)) paste(x, collapse = "") else NA})), geo, fmt))

我的第一个替代方案比第二个方案快大约 2 倍，以 2e5 行为基准。

【讨论】：

谢谢！这是我希望的解决方案类型。可能也比我的解决方案快得多。

【解决方案2】：

通常情况下，写下问题和最小示例帮助我简化问题并确定解决方案。我敢肯定那里有一个更好的解决方案，但这是我想出的，而且很容易（ish）让你了解。

虽然格式各不相同，但唯一字符的数量有限。在这个问题的玩具示例中，只有s, c, p, z。所以这就是我所做的：

首先，我创建了一个函数，它采用单个格式字符串、单个 geoid 字符串和单个 subgeo 字符/代码。该函数确定format 中的哪些字符位置与subgeo 匹配，然后从geoid 返回这些位置。

extract_sub_geo <- function(format, geoid, subgeo) {
  geoid_v <- unlist(strsplit(geoid, ""))
  format_v <- unlist(strsplit(format, ""))
  positions <- which(format_v == subgeo)
  result <- paste(geoid_v[positions], collapse = "")
  return(result)
}

extract_sub_geo("ssccczzzzzzzz", "0202000000126", "s")

[1] "02"

然后我遍历每个唯一代码并使用pmap() 将该函数应用于我的整个数据框。

geo_codes <- c("s", "c", "p", "z")

for (code in geo_codes) {
  df <- df %>%
    mutate(
      !!code := pmap_chr(list(format, remainder, !!(code)), extract_sub_geo)
    )
}

# A tibble: 2 x 6
  geoid         format        s     c     p     z       
  <chr>         <chr>         <chr> <chr> <chr> <chr>   
1 0224230       ssppppp       02    ""    02000 ""      
2 0202000000126 ssccczzzzzzzz 02    020   ""    00000126

在 base R 中而不是 dplyr 中执行循环可能会更干净。

【讨论】：

【解决方案3】：

tidyverse 解决方案：

library(tidyverse)

create_new_code <- function(id, format, char) {
    format %>% 
        str_locate_all(paste0(char, "*", char)) %>% 
        unlist() %>% 
        {substr(id, .[1], .[2])}
}

create_new_codes <- function(id, format) {
    c("s", "p", "c", "z") %>% 
        set_names() %>% 
        map(create_new_code, id = id, format = format)
}

bind_cols(df, 
          with(df, map2_df(geoid, format, create_new_codes)))

#          geoid        format  s     p    c        z
#1       0224230       ssppppp 02 24230 <NA>     <NA>
#2 0202000000126 ssccczzzzzzzz 02  <NA>  020 00000126

【讨论】：