【问题标题】：splitting columns based on separator into an undefined number of columns基于分隔符将列拆分为未定义数量的列
【发布时间】：2020-04-23 15:28:00
【问题描述】：

我正在尝试根据| 将我的数据拆分为新列。例如我有这样的观察：

fdic : Federal Deposit Insurance Corp | unbco : United Bancorp Inc Ohio

我想根据| 分成两列。然而，有些观察没有分隔符，有些有超过 2 个分隔符，并且无法使用来自 tidyr 的 separate。我有以下行 as.data.frame(do.call(rbind, strsplit(xx$CO, "\\|"))) - 这几乎可以满足我的要求，但它在分离时会重复观察。

那是;

第一次观察。

evgnl : Evogene Limited | monsan : Monsanto Company

第 1 列和第 2 列正确拆分，但它重复第 1 列。

 evgnl : Evogene Limited     monsan : Monsanto Company       evgnl : Evogene Limited

我希望这些观察结果具有NA 值。

 evgnl : Evogene Limited     monsan : Monsanto Company       NA

数据：

structure(list(grp = c("10163", "8518", "2533", "6604", "7984", 
"10689", "1911", "8092", "3091", "10878", "2193", "102", "214", 
"4486", "8789", "8352", "10769", "10366", "6406", "8634"), WC = c(" 2,685 words    ", 
" 632 words    ", " 139 words    ", " 359 words    ", " 3,610 words    ", 
" 448 words    ", " 185 words    ", " 2,321 words    ", " 192 words    ", 
" 830 words    ", " 803 words    ", " 4,697 words    ", " 4,649 words    ", 
" 748 words    ", " 1,029 words    ", " 3,125 words    ", " 44 words    ", 
" 3,212 words    ", " 1,150 words    ", " 774 words    "), CO = c(" evgnl : Evogene Limited | monsan : Monsanto Company    ", 
" codvbc : Codorus Valley Bancorp Inc    ", " blycon : Blyth Inc    ", 
" icfcns : ICF International Inc.    ", " fossil : Fossil Group Inc    ", 
" jpmsi : JP Morgan Securities LLC | rganus : Reinsurance Group of America Inc | cnyc : JPMorgan Chase & Co.    ", 
" usxmar : US Steel Corp    ", "NULL", " toro : The Toro Company    ", 
" casms : CAS Medical Systems Inc    ", " fdic : Federal Deposit Insurance Corp | unbco : United Bancorp Inc Ohio    ", 
" crane : Crane Co    ", " pplres : PPL Corp    ", " unnatf : United Natural Foods Inc    ", 
" intgxc : IntelGenx Technologies Corp.    ", " gordmi : Gordmans Stores, Inc. | scp : Sun Capital Partners Inc    ", 
"NULL", " crginc : Cargill, Inc.    ", "NULL", " cytmxt : CytomX Therapeutics, Inc.    "
)), class = "data.frame", row.names = c(NA, -20L))

【问题讨论】：

标签： r

【解决方案1】：

data.table::tstrsplit 允许您使用参数fixed = FALSE 做到这一点：

library(data.table)
setDT(df)
df[,tstrsplit(CO, "\\|", fixed = FALSE)]
 V1                                          V2
 1:                   evgnl : Evogene Limited                monsan : Monsanto Company    
 2:    codvbc : Codorus Valley Bancorp Inc                                            <NA>
 3:                     blycon : Blyth Inc                                            <NA>
 4:        icfcns : ICF International Inc.                                            <NA>
 5:              fossil : Fossil Group Inc                                            <NA>
 6:          jpmsi : JP Morgan Securities LLC   rganus : Reinsurance Group of America Inc 
 7:                 usxmar : US Steel Corp                                            <NA>
 8:                                       NULL                                        <NA>
 9:                toro : The Toro Company                                            <NA>
10:        casms : CAS Medical Systems Inc                                            <NA>
11:     fdic : Federal Deposit Insurance Corp          unbco : United Bancorp Inc Ohio    
12:                       crane : Crane Co                                            <NA>
13:                      pplres : PPL Corp                                            <NA>
14:      unnatf : United Natural Foods Inc                                            <NA>
15:  intgxc : IntelGenx Technologies Corp.                                            <NA>
16:            gordmi : Gordmans Stores, Inc.           scp : Sun Capital Partners Inc    
17:                                       NULL                                        <NA>
18:                 crginc : Cargill, Inc.                                            <NA>
19:                                       NULL                                        <NA>
20:     cytmxt : CytomX Therapeutics, Inc.                                            <NA>
                                  V3
 1:                             <NA>
 2:                             <NA>
 3:                             <NA>
 4:                             <NA>
 5:                             <NA>
 6:  cnyc : JPMorgan Chase & Co.    
 7:                             <NA>
 8:                             <NA>
 9:                             <NA>
10:                             <NA>
11:                             <NA>
12:                             <NA>
13:                             <NA>
14:                             <NA>
15:                             <NA>
16:                             <NA>
17:                             <NA>
18:                             <NA>
19:                             <NA>
20:                             <NA>

你最终得到一个data.table 对象（增强的data.frame）

字符串

您还可以使用 stringr 和 endup 与矩阵：

 stringr::str_split(df$CO, "\\|", simplify = TRUE)
      [,1]                                         [,2]                                         
 [1,] " evgnl : Evogene Limited "                  " monsan : Monsanto Company    "             
 [2,] " codvbc : Codorus Valley Bancorp Inc    "   ""                                           
 [3,] " blycon : Blyth Inc    "                    ""                                           
 [4,] " icfcns : ICF International Inc.    "       ""                                           
 [5,] " fossil : Fossil Group Inc    "             ""                                           
 [6,] " jpmsi : JP Morgan Securities LLC "         " rganus : Reinsurance Group of America Inc "
 [7,] " usxmar : US Steel Corp    "                ""                                           
 [8,] "NULL"                                       ""                                           
 [9,] " toro : The Toro Company    "               ""                                           
[10,] " casms : CAS Medical Systems Inc    "       ""                                           
[11,] " fdic : Federal Deposit Insurance Corp "    " unbco : United Bancorp Inc Ohio    "       
[12,] " crane : Crane Co    "                      ""                                           
[13,] " pplres : PPL Corp    "                     ""                                           
[14,] " unnatf : United Natural Foods Inc    "     ""                                           
[15,] " intgxc : IntelGenx Technologies Corp.    " ""                                           
[16,] " gordmi : Gordmans Stores, Inc. "           " scp : Sun Capital Partners Inc    "        
[17,] "NULL"                                       ""                                           
[18,] " crginc : Cargill, Inc.    "                ""                                           
[19,] "NULL"                                       ""                                           
[20,] " cytmxt : CytomX Therapeutics, Inc.    "    ""                                           
      [,3]                              
 [1,] ""                                
 [2,] ""                                
 [3,] ""                                
 [4,] ""                                
 [5,] ""                                
 [6,] " cnyc : JPMorgan Chase & Co.    "
 [7,] ""                                
 [8,] ""                                
 [9,] ""                                
[10,] ""                                
[11,] ""                                
[12,] ""                                
[13,] ""                                
[14,] ""                                
[15,] ""                                
[16,] ""                                
[17,] ""                                
[18,] ""                                
[19,] ""                                
[20,] ""

【讨论】：

【解决方案2】：

这是一个使用 dplyr 和 tidyr 的单列。
关键是使用separate_rows函数分离成不确定的行数，然后到pivot_wider转换回所需的数据帧。

library(tidyr)
library(dplyr)
df %>% separate_rows(CO, sep="\\|") %>% 
                group_by(grp, WC) %>% 
                mutate(ColID=row_number()) %>%   
                pivot_wider(id_cols=c(grp, WC), names_from = ColID, values_from = CO)

【讨论】：