【问题标题】:Splitting strings into multiple fixed width columns将字符串拆分为多个固定宽度的列
【发布时间】:2019-03-02 16:22:44
【问题描述】:

我正在尝试使用 str_split 将以下观察结果拆分为特定格式。

"00010943900008" "00010946803119" "00010946803219" "00010946803219" "00010946803219" "00010948700007"

我正在尝试将其拆分为不同的列。

所以第一个观察结果如下所示:

Column x = 00

Column y = 01

Column z = 09439

Column w = 00008

其中 x 列将始终是观察中的前 2 个数字,y 列将是接下来的 2 个数字,z 列将是接下来的 5 个数字,w 列将是最后 5 个数字

数据

string <- c("00010943900008", "00010946803119", "00010946803219", "00010946803219", 
"00010946803219", "00010948700007", "00010948700007", "00010948700007", 
"00010948700007", "00010948700007", "00010948700007", "00010948700007", 
"00010948700007", "00010948700007", "00010948700007", "00010948700007", 
"00010948700007", "00010948700007", "00010948700007", "00010948700007", 
"00010948700007", "00010948700007", "00010948700007", "00010948700007", 
"00010948700007", "00010948700007", "00010948700007", "00010948700007", 
"00010948700007", "00010948700007", "00010948700007", "00010948700007", 
"00010948700007", "00010948700007", "00010948700007", "00010948700007", 
"00010948700007", "00010948700007", "00011820000016", "00011820000016", 
"00011820000016", "00011820000016", "00011820000016", "00011820000016", 
"00011820000016", "00011820000016", "00011820000016", "00011820000016", 
"00011820000016", "00011820000016", "00011820000016", "00011820000016", 
"00011820000016", "00011820000016", "00011820000016", "00011820000016", 
"00011820000016", "00011820000016", "00011820000016", "00011820000016", 
"00011820000016", "00011820000016", "00011820000016", "00011820000016", 
"00011820000016", "00011820000016", "00011820000016", "00011820000016", 
"00011820000016", "00011820000016", "00011820000016", "00011820000016", 
"00011820000016", "00011820000016", "00011820000016", "00011820000016", 
"00011820000016", "00011820000016", "00011820000016", "00011820000016", 
"00011820000016", "00011820000016", "00011820000016", "00011820000016", 
"00011820000016", "00011820000016", "00011820000016", "00011820000016", 
"00011820000016", "00011820000016", "00011820000016", "00011820000016", 
"00011820000016", "00011820000016", "00011820000016", "00011820000016", 
"00011820000016", "00011820000016", "00011820000016", "00011820000016"
)

【问题讨论】:

  • 我建议将其写入文件,然后使用read.fwf 将其读回。否则可能使用substr

标签: r regex string split


【解决方案1】:

您可以使用\n 作为分隔符连接您的数据或将其写入文件,然后使用readr::read_fwfread.fwf(仅来自文件)将其作为固定宽度格式导入。这是没有写入磁盘的readr::read_fwf 版本:

library(readr)
result = read_fwf(paste(string, collapse = "\n"),
                  col_positions = fwf_widths(c(2, 2, 5, 5), col_names = c("x", "y", "z", "w")))
head(result)
# # A tibble: 6 x 4
#   x     y     z     w
#   <chr> <chr> <chr> <chr>
# 1 00    01    09439 00008
# 2 00    01    09468 03119
# 3 00    01    09468 03219
# 4 00    01    09468 03219
# 5 00    01    09468 03219
# 6 00    01    09487 00007

【讨论】:

  • textConnectionread.fwf 一起使用将使用字符串:read.fwf(textConnection(string), widths = c(2,2,5,5), col_names = c("x", "y", "z", "w")
【解决方案2】:

来自tidyrextractextract 将每个正则表达式捕获组转换为自己的列。如果不想保留原来的列,可以设置remove = TRUE(默认):

library(dplyr)
library(tidyr)

string %>%
  data.frame(string = .) %>%
  extract(string, c("x","y","z","w"), "^(\\d{2})(\\d{2})(\\d{5})(\\d{5})", remove = FALSE)

输出:

            string  x  y     z     w
1   00010943900008 00 01 09439 00008
2   00010946803119 00 01 09468 03119
3   00010946803219 00 01 09468 03219
4   00010946803219 00 01 09468 03219
5   00010946803219 00 01 09468 03219
6   00010948700007 00 01 09487 00007
7   00010948700007 00 01 09487 00007
8   00010948700007 00 01 09487 00007
9   00010948700007 00 01 09487 00007
10  00010948700007 00 01 09487 00007
11  00010948700007 00 01 09487 00007
12  00010948700007 00 01 09487 00007

【讨论】:

    【解决方案3】:

    您可以从字符串创建一个数据框,然后使用 substr(),它根据位置返回部分字符串:

    data<- as.data.frame(string)
    data$x <- substr(string,1,2)
    data$y <- substr(string,3,4)
    data$z <- substr(string,5,9)
    data$w <- substr(string,10,14)
    

    【讨论】:

      【解决方案4】:

      我们可以使用regexread.table(这只有在模式相同的情况下才有效):

      > read.table(text=gsub("(\\d{2})(\\d{2})(\\d{5})(\\d{5})", "\\1,\\2,\\3,\\4", string),
                   colClasses="character", sep=",", stringsAsFactors = FALSE)
          V1 V2    V3    V4
      1   00 01 09439 00008
      2   00 01 09468 03119
      3   00 01 09468 03219
      4   00 01 09468 03219
      5   00 01 09468 03219
      6   00 01 09487 00007
      7   00 01 09487 00007
      8   00 01 09487 00007
      9   00 01 09487 00007
      10  00 01 09487 00007
      ...
      

      【讨论】:

        【解决方案5】:

        使用 tidyr::separate:

        library(tidyr)
        
        data.frame(string = string[1:5]) %>% 
          separate(string, c("x", "y", "z", "w"),
                   sep = c(2, 4, 9), remove = FALSE)
        
        #           string  x  y     z     w
        # 1 00010943900008 00 01 09439 00008
        # 2 00010946803119 00 01 09468 03119
        # 3 00010946803219 00 01 09468 03219
        # 4 00010946803219 00 01 09468 03219
        # 5 00010946803219 00 01 09468 03219
        

        【讨论】:

          猜你喜欢
          • 1970-01-01
          • 1970-01-01
          • 1970-01-01
          • 2020-04-18
          • 1970-01-01
          • 1970-01-01
          • 2015-03-12
          • 1970-01-01
          • 2012-05-15
          相关资源
          最近更新 更多