【问题标题】:Breaking up a character string into multiple character strings on different lines将一个字符串分解为不同行上的多个字符串
【发布时间】:2011-12-05 19:41:06
【问题描述】:

我有一个数据框,其中包含一个长字符串,每个字符串都与一个“样本”相关联:

Sample  Data
  1     000000000000000000000000000N01000000000000N0N000000000N00N0000NN00N0N000000100000N00N0N0000000NNNN011111111111111111111111111111110000000000000000000N000000N0000000000N
  2     000000000000000000000000000N01000000000000N0N000000000N00N0000NN00N0N000000100000N00N0N0000000NNNN011111111111111111111111111111110000000000000000000N000000N0000000000N

我想编写一个简单的方法来将这个字符串分成 5 段,格式如下:

Sample X
CCT6 - Characters 1-33
GAT1 - Characters 34-68
IMD3 - Characters 69-99
PDR3 - Characters 100-130
RIM15 - Characters 131-168

为每个样本提供如下所示的输出:

Sample 1
CCT6 - 000000000000000000000000000N01000
GAT1 - 000000000N0N000000000N00N0000NN00N0
IMD3 - N000000100000N00N0N0000000NNNN0
PDR3 - 1111111111111111111111111111111
RIM15 - 0000000000000000000N000000N0000000000N

我已经能够使用substr 函数将长字符串分成单独的部分,但我希望能够自动化它,以便我可以在一个输出中获得所有 5 部分。理想情况下,这个输出也是一个数据框。

【问题讨论】:

    标签: r character dataframe


    【解决方案1】:

    这就是?read.fwf 的用途。

    首先一些看起来像你的问题的数据:

    x <- data.frame(Sample = c(1, 2), Data = c("000000000000000000000000000N01000000000000N0N000000000N00N0000NN00N0N000000100000N00N0N0000000NNNN011111111111111111111111111111110000000000000000000N000000N0000000000N", 
    "000000000000000000000000000N01000000000000N0N000000000N00N0000NN00N0N000000100000N00N0N0000000NNNN011111111111111111111111111111110000000000000000000N000000N0000000000N"), 
    stringsAsFactors = FALSE)
    

    现在使用read.fwf,指定每个字段的宽度及其名称,并且所有字段的模式都应为character。我们将示例数据的文本列包装在textConnection 中,以便我们可以将其视为read.* 和其他函数普遍理解的连接。

    (strs <- read.fwf(textConnection(x$Data), widths = c(33, 35, 31, 31, 38), colClasses = "character", col.names = c("CCT6", "GAT1", "IMD3", "PDR3", "RIM15")))
    
    
                                   CCT6                                GAT1                            IMD3                            PDR3                                  RIM15
    1 000000000000000000000000000N01000 000000000N0N000000000N00N0000NN00N0 N000000100000N00N0N0000000NNNN0 1111111111111111111111111111111 0000000000000000000N000000N0000000000N
    2 000000000000000000000000000N01000 000000000N0N000000000N00N0000NN00N0 N000000100000N00N0N0000000NNNN0 1111111111111111111111111111111 0000000000000000000N000000N0000000000N
    

    现在循环遍历行并按照您的示例打印出每一行:

    for (i in 1:nrow(strs)) {
      writeLines(paste("Sample", i))
      writeLines(paste(names(strs), strs[i, ], sep = " - "))
    }
    

    给予,例如:

    Sample 2
    CCT6 - 000000000000000000000000000N01000
    GAT1 - 000000000N0N000000000N00N0000NN00N0
    IMD3 - N000000100000N00N0N0000000NNNN0
    PDR3 - 1111111111111111111111111111111
    RIM15 - 0000000000000000000N000000N0000000000N
    

    【讨论】:

    • 这很好用!我只是不知道如何保存最终数据,以便以后可以再次访问。
    • 您可以打开文件连接并使用带有con= 参数的writeLines,或者您可以使用save(strs, file="strpieces.rda")
    • 我现在使用此代码遇到的一个问题是,它将原始样本 ID 号与最终结构中的数据分开。在我的示例中,样本按从数字 1 开始的顺序出现。但是,在我的实际数据集中,情况并非如此。如何维护连接,以便最终输出将原始数据形式中的任何样本附加到分解的字符串?
    【解决方案2】:
    SampX <- textConnection("CCT6 - Characters 1-33
    GAT1 - Characters 34-68
    IMD3 - Characters 69-99
    PDR3 - Characters 100-130
    RIM15 - Characters 131-168")
    dfSampX <-read.table(SampX, sep="-")
    dfSampX$V4 <- as.numeric(sub("Characters ", "", dfSampX$V2))
    
    sampdat <- read.table(textConnection("Sample  Data
      1     000000000000000000000000000N01000000000000N0N000000000N00N0000NN00N0N000000100000N00N0N0000000NNNN011111111111111111111111111111110000000000000000000N000000N0000000000N
      2     000000000000000000000000000N01000000000000N0N000000000N00N0000NN00N0N000000100000N00N0N0000000NNNN011111111111111111111111111111110000000000000000000N000000N0000000000N
    "), header=TRUE,stringsAsFactors=FALSE)
    

    这段代码会分成几段:

     apply(dfSampX[,c(3,4)], 1, function(x) substr(sampdat[,2], x["V4"], x["V3"]) )
         [,1]                                [,2]                                 
    [1,] "000000000000000000000000000N01000" "000000000N0N000000000N00N0000NN00N0"
    [2,] "000000000000000000000000000N01000" "000000000N0N000000000N00N0000NN00N0"
         [,3]                              [,4]                             
    [1,] "N000000100000N00N0N0000000NNNN0" "1111111111111111111111111111111"
    [2,] "N000000100000N00N0N0000000NNNN0" "1111111111111111111111111111111"
         [,5]                                    
    [1,] "0000000000000000000N000000N0000000000N"
    [2,] "0000000000000000000N000000N0000000000N"
    

    此代码将以列表格式传递片段:

    res <- lapply(sampdat$Data, function(x) 
               apply(dfSampX[,c(3,4)], 1, function(y) substr(x, y["V4"], y["V3"]) ))
    
    res2 <- lapply(res, function(x){ names(x) <- dfSampX$V1 ; return(x)} )
    res2
    
    [[1]]
                                       CCT6                                     GAT1  
         "000000000000000000000000000N01000"    "000000000N0N000000000N00N0000NN00N0" 
                                       IMD3                                     PDR3  
           "N000000100000N00N0N0000000NNNN0"        "1111111111111111111111111111111" 
                                      RIM15  
    "0000000000000000000N000000N0000000000N" 
    
    [[2]]
                                       CCT6                                     GAT1  
         "000000000000000000000000000N01000"    "000000000N0N000000000N00N0000NN00N0" 
                                       IMD3                                     PDR3  
           "N000000100000N00N0N0000000NNNN0"        "1111111111111111111111111111111" 
                                      RIM15  
    "0000000000000000000N000000N0000000000N" 
    

    并得到指定的输出格式:

     for (samp in seq_along(res2) ) { cat("Sample ", samp, "\n")
             invisible( sapply(1:5, function(y) 
                cat(as.character(dfSampX$V1[y]), " - ", res2[[samp]][y], "\n") ) ) }
    Sample  1 
    CCT6   -  000000000000000000000000000N01000 
    GAT1   -  000000000N0N000000000N00N0000NN00N0 
    IMD3   -  N000000100000N00N0N0000000NNNN0 
    PDR3   -  1111111111111111111111111111111 
    RIM15   -  0000000000000000000N000000N0000000000N 
    Sample  2 
    CCT6   -  000000000000000000000000000N01000 
    GAT1   -  000000000N0N000000000N00N0000NN00N0 
    IMD3   -  N000000100000N00N0N0000000NNNN0 
    PDR3   -  1111111111111111111111111111111 
    RIM15   -  0000000000000000000N000000N0000000000N 
    

    需要invisible 来抑制从列表结构返回 NULL。

    【讨论】:

    • 嗯...我不相信这正是我正在寻找的。我希望能够在具有许多样本的数据框上运行脚本。在上面,您似乎已将整个字符串输入到每个样本的代码中。我还希望我的输出看起来像我上面提供的示例。
    • 你用 str() 查看过“sampdat”对象吗?它与您的数据不同吗?如果是这样,请在您的对象上提供 dput()。
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2018-04-18
    相关资源
    最近更新 更多