【问题标题】:Converting a list of tab- and newline-delimited character vectors into a dataframe in R将制表符和换行符分隔的字符向量列表转换为 R 中的数据帧
【发布时间】:2021-10-20 01:29:59
【问题描述】:

我有一个对象,它是一个制表符和换行符分隔的字符向量列表。

 >dput(res)
 list("# BLASTP 2.11.0+\n# Query: tr|A4I9M8|A4I9M8_LEIIN 
 Hypothetical_protein_-_conserved OS=Leishmania infantum OX=5671 
 GN=LINF_340011400 PE=4 SV=1\n# RID: HT7Z4V6H016\n# Database: nr\n# 
 Fields: subject id, evalue, % query coverage per subject, % identity\n# 
 35 hits found\nref|XP_001468447.1|\t8.15e- 
 82\t100\t100.000\nref|XP_001686149.1|\t1.33e- 
 73\t100\t90.984\ndbj|GET92147.1|\t4.31e- 
 67\t99\t82.645\ngb|KAG5468145.1|\t2.87e- 
 66\t100\t82.787\nref|XP_003878644.1|\t1.96e- 
 64\t100\t88.525\n# BLASTP 2.11.0+\n# Query: tr|A4HT95|A4HT95_LEIIN 
 Centrin_-_putative OS=Leishmania infantum OX=5671 GN=LINF_070012700 
 PE=3 SV=1\n# RID: HT7Z4V6H016\n# Database: nr\n# Fields: subject id, 
 evalue, % query coverage per subject, % identity\n# 512 hits 
 found\nref|XP_001463286.1|\t1.53e- 
 132\t100\t100.000\nref|XP_001680978.1|\t9.57e- 
 125\t100\t95.187\nref|XP_003872218.1|\t2.29e- 
 121\t100\t93.048\ndbj|GET86075.1|\t1.03e- 
 114\t97\t89.071\ngb|KAG5510937.1|\t1.50e- 
 103\t100\t80.214\ngb|KAG5486865.1|\t2.99e-103\t100\t84.492\n# BLAST 
 processed 2 queries\n", 
 "")

这是的结构

>str(res)
List of 2
$ : chr "# BLASTP 2.11.0+\n# Query: tr|A4I9M8|A4I9M8_LEIIN Hypothetical_protein_-_conserved 
OS=Leishmania infantum OX=56"| __truncated__
$ : chr ""

如何将 转换为 R 中的数据框?数据框应如下所示:

【问题讨论】:

    标签: res res res r dataframe delimiter csv


    【解决方案1】:

    这太可怕了,但是......

    # remove the artificial newlines from sci notation
    res_clean <- gsub("e- \n", "e-", res[[1]])
    
    # split on newline -- each element is a row
    res_split <- strsplit(res_clean[[1]], split = "\n")[[1]]
    
    # get number of variables to add
    num_vars <- max(sapply(strsplit(res_split, "\t"), length))
    
    # create empty data frame with correct number of variables
    dat <- read.csv(text = "", col.names = paste0("v", 1:num_vars))
    
    # loop over rows to add
    for (i in seq_along(res_split)) {
        # get row vector
        new_row <- strsplit(res_split[i], split = "\t")[[1]]
        # make sure row vector is correct length by appending with NAs
        new_row <- append(new_row, values = rep(NA, num_vars - length(new_row)), after = length(new_row))
        # add to data
        dat[i, ] <- new_row
    }
    
    > head(dat, 20)
                                                                          v1        v2   v3      v4
    #1                                                       # BLASTP 2.11.0+      <NA> <NA>    <NA>
    #2                                       # Query: tr|A4I9M8|A4I9M8_LEIIN       <NA> <NA>    <NA>
    #3       Hypothetical_protein_-_conserved OS=Leishmania infantum OX=5671       <NA> <NA>    <NA>
    #4                                            GN=LINF_340011400 PE=4 SV=1      <NA> <NA>    <NA>
    #5                                                     # RID: HT7Z4V6H016      <NA> <NA>    <NA>
    #6                                                         # Database: nr      <NA> <NA>    <NA>
    #7                                                                     #       <NA> <NA>    <NA>
    #8   Fields: subject id, evalue, % query coverage per subject, % identity      <NA> <NA>    <NA>
    #9                                                                     #       <NA> <NA>    <NA>
    #10                                                         35 hits found      <NA> <NA>    <NA>
    #11                                                   ref|XP_001468447.1| 8.15e- 82  100 100.000
    #12                                                   ref|XP_001686149.1| 1.33e- 73  100  90.984
    #13                                                       dbj|GET92147.1| 4.31e- 67   99  82.645
    #14                                                      gb|KAG5468145.1| 2.87e- 66  100  82.787
    #15                                                   ref|XP_003878644.1| 1.96e- 64  100  88.525
    #16                                                      # BLASTP 2.11.0+      <NA> <NA>    <NA>
    #17                                      # Query: tr|A4HT95|A4HT95_LEIIN       <NA> <NA>    <NA>
    #18  Centrin_-_putative OS=Leishmania infantum OX=5671 GN=LINF_070012700       <NA> <NA>    <NA>
    #19                                                             PE=3 SV=1      <NA> <NA>    <NA>
    #20                                                    # RID: HT7Z4V6H016      <NA> <NA>    <NA>
    

    【讨论】:

    • 感谢您的努力,但数据框看起来有些失真。
    • 啊,是的,里面有一些人为的换行符。现在检查编辑,看起来怎么样?
    • v2、v3、v4 列中具有值的行是固定的。但是注释行被分成多行。检查问题附带的图像。这就是数据框的样子。
    • 如果您查看从res 打印的内容,到处都有换行符(例如,在哈希之后,这就是注释行被分成多行的原因)。我不确定你能做些什么来处理一些选择性的gsub,就像我在上面展示的那样。
    猜你喜欢
    • 2021-10-12
    • 2019-06-03
    • 1970-01-01
    • 1970-01-01
    • 2023-01-21
    • 1970-01-01
    • 2023-01-18
    • 2018-11-24
    • 2019-03-11
    相关资源
    最近更新 更多