【问题标题】:splitting columns based on separator into an undefined number of columns基于分隔符将列拆分为未定义数量的列
【发布时间】:2020-04-23 15:28:00
【问题描述】:

我正在尝试根据| 将我的数据拆分为新列。例如我有这样的观察:

fdic : Federal Deposit Insurance Corp | unbco : United Bancorp Inc Ohio

我想根据| 分成两列。然而,有些观察没有分隔符,有些有超过 2 个分隔符,并且无法使用来自 tidyrseparate。我有以下行 as.data.frame(do.call(rbind, strsplit(xx$CO, "\\|"))) - 这几乎可以满足我的要求,但它在分离时会重复观察。

那是;

第一次观察。

evgnl : Evogene Limited | monsan : Monsanto Company

第 1 列和第 2 列正确拆分,但它重复第 1 列。

 evgnl : Evogene Limited     monsan : Monsanto Company       evgnl : Evogene Limited 

我希望这些观察结果具有NA 值。

 evgnl : Evogene Limited     monsan : Monsanto Company       NA 

数据:

structure(list(grp = c("10163", "8518", "2533", "6604", "7984", 
"10689", "1911", "8092", "3091", "10878", "2193", "102", "214", 
"4486", "8789", "8352", "10769", "10366", "6406", "8634"), WC = c(" 2,685 words    ", 
" 632 words    ", " 139 words    ", " 359 words    ", " 3,610 words    ", 
" 448 words    ", " 185 words    ", " 2,321 words    ", " 192 words    ", 
" 830 words    ", " 803 words    ", " 4,697 words    ", " 4,649 words    ", 
" 748 words    ", " 1,029 words    ", " 3,125 words    ", " 44 words    ", 
" 3,212 words    ", " 1,150 words    ", " 774 words    "), CO = c(" evgnl : Evogene Limited | monsan : Monsanto Company    ", 
" codvbc : Codorus Valley Bancorp Inc    ", " blycon : Blyth Inc    ", 
" icfcns : ICF International Inc.    ", " fossil : Fossil Group Inc    ", 
" jpmsi : JP Morgan Securities LLC | rganus : Reinsurance Group of America Inc | cnyc : JPMorgan Chase & Co.    ", 
" usxmar : US Steel Corp    ", "NULL", " toro : The Toro Company    ", 
" casms : CAS Medical Systems Inc    ", " fdic : Federal Deposit Insurance Corp | unbco : United Bancorp Inc Ohio    ", 
" crane : Crane Co    ", " pplres : PPL Corp    ", " unnatf : United Natural Foods Inc    ", 
" intgxc : IntelGenx Technologies Corp.    ", " gordmi : Gordmans Stores, Inc. | scp : Sun Capital Partners Inc    ", 
"NULL", " crginc : Cargill, Inc.    ", "NULL", " cytmxt : CytomX Therapeutics, Inc.    "
)), class = "data.frame", row.names = c(NA, -20L))

【问题讨论】:

    标签: r


    【解决方案1】:

    data.table::tstrsplit 允许您使用参数fixed = FALSE 做到这一点:

    library(data.table)
    setDT(df)
    df[,tstrsplit(CO, "\\|", fixed = FALSE)]
     V1                                          V2
     1:                   evgnl : Evogene Limited                monsan : Monsanto Company    
     2:    codvbc : Codorus Valley Bancorp Inc                                            <NA>
     3:                     blycon : Blyth Inc                                            <NA>
     4:        icfcns : ICF International Inc.                                            <NA>
     5:              fossil : Fossil Group Inc                                            <NA>
     6:          jpmsi : JP Morgan Securities LLC   rganus : Reinsurance Group of America Inc 
     7:                 usxmar : US Steel Corp                                            <NA>
     8:                                       NULL                                        <NA>
     9:                toro : The Toro Company                                            <NA>
    10:        casms : CAS Medical Systems Inc                                            <NA>
    11:     fdic : Federal Deposit Insurance Corp          unbco : United Bancorp Inc Ohio    
    12:                       crane : Crane Co                                            <NA>
    13:                      pplres : PPL Corp                                            <NA>
    14:      unnatf : United Natural Foods Inc                                            <NA>
    15:  intgxc : IntelGenx Technologies Corp.                                            <NA>
    16:            gordmi : Gordmans Stores, Inc.           scp : Sun Capital Partners Inc    
    17:                                       NULL                                        <NA>
    18:                 crginc : Cargill, Inc.                                            <NA>
    19:                                       NULL                                        <NA>
    20:     cytmxt : CytomX Therapeutics, Inc.                                            <NA>
                                      V3
     1:                             <NA>
     2:                             <NA>
     3:                             <NA>
     4:                             <NA>
     5:                             <NA>
     6:  cnyc : JPMorgan Chase & Co.    
     7:                             <NA>
     8:                             <NA>
     9:                             <NA>
    10:                             <NA>
    11:                             <NA>
    12:                             <NA>
    13:                             <NA>
    14:                             <NA>
    15:                             <NA>
    16:                             <NA>
    17:                             <NA>
    18:                             <NA>
    19:                             <NA>
    20:                             <NA>
    
    

    你最终得到一个data.table 对象(增强的data.frame

    字符串

    您还可以使用 stringr 和 endup 与矩阵:

     stringr::str_split(df$CO, "\\|", simplify = TRUE)
          [,1]                                         [,2]                                         
     [1,] " evgnl : Evogene Limited "                  " monsan : Monsanto Company    "             
     [2,] " codvbc : Codorus Valley Bancorp Inc    "   ""                                           
     [3,] " blycon : Blyth Inc    "                    ""                                           
     [4,] " icfcns : ICF International Inc.    "       ""                                           
     [5,] " fossil : Fossil Group Inc    "             ""                                           
     [6,] " jpmsi : JP Morgan Securities LLC "         " rganus : Reinsurance Group of America Inc "
     [7,] " usxmar : US Steel Corp    "                ""                                           
     [8,] "NULL"                                       ""                                           
     [9,] " toro : The Toro Company    "               ""                                           
    [10,] " casms : CAS Medical Systems Inc    "       ""                                           
    [11,] " fdic : Federal Deposit Insurance Corp "    " unbco : United Bancorp Inc Ohio    "       
    [12,] " crane : Crane Co    "                      ""                                           
    [13,] " pplres : PPL Corp    "                     ""                                           
    [14,] " unnatf : United Natural Foods Inc    "     ""                                           
    [15,] " intgxc : IntelGenx Technologies Corp.    " ""                                           
    [16,] " gordmi : Gordmans Stores, Inc. "           " scp : Sun Capital Partners Inc    "        
    [17,] "NULL"                                       ""                                           
    [18,] " crginc : Cargill, Inc.    "                ""                                           
    [19,] "NULL"                                       ""                                           
    [20,] " cytmxt : CytomX Therapeutics, Inc.    "    ""                                           
          [,3]                              
     [1,] ""                                
     [2,] ""                                
     [3,] ""                                
     [4,] ""                                
     [5,] ""                                
     [6,] " cnyc : JPMorgan Chase & Co.    "
     [7,] ""                                
     [8,] ""                                
     [9,] ""                                
    [10,] ""                                
    [11,] ""                                
    [12,] ""                                
    [13,] ""                                
    [14,] ""                                
    [15,] ""                                
    [16,] ""                                
    [17,] ""                                
    [18,] ""                                
    [19,] ""                                
    [20,] ""
    

    【讨论】:

      【解决方案2】:

      这是一个使用 dplyr 和 tidyr 的单列。
      关键是使用separate_rows函数分离成不确定的行数,然后到pivot_wider转换回所需的数据帧。

      library(tidyr)
      library(dplyr)
      df %>% separate_rows(CO, sep="\\|") %>% 
                      group_by(grp, WC) %>% 
                      mutate(ColID=row_number()) %>%   
                      pivot_wider(id_cols=c(grp, WC), names_from = ColID, values_from = CO)
      

      【讨论】:

        猜你喜欢
        • 2018-10-08
        • 1970-01-01
        • 2013-05-22
        • 1970-01-01
        • 2018-03-17
        • 2022-12-10
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        相关资源
        最近更新 更多