【问题标题】:Parse txt file to dataframe in R将txt文件解析为R中的数据框
【发布时间】:2020-12-16 17:45:40
【问题描述】:

我有一个包含目标预测信息的 txt 文件,我想将其解析为 R 中的数据框。文件中的信息已经采用最简单的方式。每行将成为未来数据框中的一行,只有 4 列,应该如下所示:

  MicroRNA                          Transcript                         Type   Energy
miR-981|LQNS02278082.1_33127_3p      TRINITY_GG_20135_c0_g1_i5.mrna1   7_A1   -0.70

但是,我在 R 中所做的事情不起作用。

a <- read_lines("results")
> head(a)
[1] "MicroRNA   = miR-981|LQNS02278082.1_33127_3p\t\tTranscript = TRINITY_GG_20135_c0_g1_i5.mrna1 Dir=antisense TAG=Neuronal acetylcholine receptor subunit alpha-9\t\tType       = 7_A1\t\tEnergy     = -0.70 Kcal/mol"                        
[2] "MicroRNA   = miR-981|LQNS02278082.1_33127_3p\t\tTranscript = TRINITY_GG_20135_c0_g1_i5.mrna1 Dir=antisense TAG=Neuronal acetylcholine receptor subunit alpha-9\t\tType       = 7_A1\t\tEnergy     = -5.77 Kcal/mol"                        
[3] "MicroRNA   = LQNS02278125.1_38470_3p\t\tTranscript = TRINITY_GG_22182_c1_g1_i2.mrna1 Dir=antisense TAG=Uncharacterized protein\t\tType       = 7_A1\t\tEnergy     = -1.77 Kcal/mol"                                                        
[4] "MicroRNA   = LQNS02278125.1_38470_3p\t\tTranscript = TRINITY_GG_22182_c1_g1_i2.mrna1 Dir=antisense TAG=Uncharacterized protein\t\tType       = 7_A1\t\tEnergy     = -5.20 Kcal/mol"                                                        
[5] "MicroRNA   = LQNS02278075.1_32377_3p\t\tTranscript = TRINITY_GG_143691_c0_g1_i3.mrna1 Dir=sense TAG=Acidic phospholipase A2 PA4\t\tType       = 7_A1\t\tEnergy     = -3.30 Kcal/mol"                                                       
[6] "MicroRNA   = miR-317|LQNS02000228.1_2413_3p\t\tTranscript = TRINITY_GG_4592_c2_g1_i10.mrna1 Dir=sense TAG=Serine/threonine-protein phosphatase 2A regulatory subunit B'' subunit gamma\t\tType       = 7_m8\t\tEnergy     = -6.35 Kcal/mol"


dput(head(a,4))
c("MicroRNA   = miR-981|LQNS02278082.1_33127_3p\t\tTranscript = TRINITY_GG_20135_c0_g1_i5.mrna1 Dir=antisense TAG=Neuronal acetylcholine receptor subunit alpha-9\t\tType       = 7_A1\t\tEnergy     = -0.70 Kcal/mol", 
"MicroRNA   = miR-981|LQNS02278082.1_33127_3p\t\tTranscript = TRINITY_GG_20135_c0_g1_i5.mrna1 Dir=antisense TAG=Neuronal acetylcholine receptor subunit alpha-9\t\tType       = 7_A1\t\tEnergy     = -5.77 Kcal/mol", 
"MicroRNA   = LQNS02278125.1_38470_3p\t\tTranscript = TRINITY_GG_22182_c1_g1_i2.mrna1 Dir=antisense TAG=Uncharacterized protein\t\tType       = 7_A1\t\tEnergy     = -1.77 Kcal/mol", 
"MicroRNA   = LQNS02278125.1_38470_3p\t\tTranscript = TRINITY_GG_22182_c1_g1_i2.mrna1 Dir=antisense TAG=Uncharacterized protein\t\tType       = 7_A1\t\tEnergy     = -5.20 Kcal/mol"
)
re <- rex(
    
    capture(name = "MicroRNA", alpha),
    "[",
    spaces,
    capture(name = "Transcript", alpha),
    "[",
    spaces,
    capture(name = "Type", alpha),
    "[",
    spaces,
    capture(name = "Energy", digits),
    "]:")

 re_matches(a, re)

    MicroRNA Transcript Type Energy
1       <NA>       <NA> <NA>   <NA>
2       <NA>       <NA> <NA>   <NA>
3       <NA>       <NA> <NA>   <NA>

知道如何在 R 或 shell 中执行此操作吗? 谢谢!

【问题讨论】:

    标签: r dataframe parsing text


    【解决方案1】:

    您可以尝试使用正则表达式

        library(stringr)
        
        #Example data
        data <- c("MicroRNA   = miR-981|LQNS02278082.1_33127_3p\t\tTranscript = TRINITY_GG_20135_c0_g1_i5.mrna1 Dir=antisense TAG=Neuronal acetylcholine receptor subunit alpha-9\t\tType       = 7_A1\t\tEnergy     = -0.70 Kcal/mol", 
          "MicroRNA   = miR-981|LQNS02278082.1_33127_3p\t\tTranscript = TRINITY_GG_20135_c0_g1_i5.mrna1 Dir=antisense TAG=Neuronal acetylcholine receptor subunit alpha-9\t\tType       = 7_A1\t\tEnergy     = -5.77 Kcal/mol", 
          "MicroRNA   = LQNS02278125.1_38470_3p\t\tTranscript = TRINITY_GG_22182_c1_g1_i2.mrna1 Dir=antisense TAG=Uncharacterized protein\t\tType       = 7_A1\t\tEnergy     = -1.77 Kcal/mol", 
          "MicroRNA   = LQNS02278125.1_38470_3p\t\tTranscript = TRINITY_GG_22182_c1_g1_i2.mrna1 Dir=antisense TAG=Uncharacterized protein\t\tType       = 7_A1\t\tEnergy     = -5.20 Kcal/mol"
        )
        
        #Split the data
        lines_split <- strsplit(data, split="\t\t", fixed=TRUE)
        #No of columns
        cols=1:4
        #Rbind rows
        df <- as.data.frame(do.call("rbind", lapply(lines_split, "[", cols)))
        
        #Extract any info after =
        df[,-2] <- lapply(df[,-2],function(x) trimws(sub('.*=', '', x)))
        
        #Since this variable has two =, we extract info between = and Dir as per your 
        #Output
        df$V2 <- str_match(df$V2, "=\\s*(.*?)\\s*Dir")[,2]
        
        #Removing Kcal/mol
        df$V4 <- as.numeric(str_replace(df$V4,"Kcal/mol",""))
    

    【讨论】:

    • 谢谢,这就是完美的工作!
    【解决方案2】:

    使用read.table

    r <- read.table(text=a, sep="\t", colClasses=c(NA, "NULL"), header=TRUE)
    nn <- unname(sapply(r, function(x) trimws(unique(sapply(strsplit(x, "="), `[`, 1)))))
    res <- setNames(as.data.frame(sapply(r, function(x) sapply(strsplit(x, "="), `[`, 2))), nn)
    

    结果:

    res
    #                           MicroRNA                           Transcript  Type          Energy
    # 1  miR-981|LQNS02278082.1_33127_3p  TRINITY_GG_20135_c0_g1_i5.mrna1 Dir  7_A1  -5.77 Kcal/mol
    # 2          LQNS02278125.1_38470_3p  TRINITY_GG_22182_c1_g1_i2.mrna1 Dir  7_A1  -1.77 Kcal/mol
    # 3          LQNS02278125.1_38470_3p  TRINITY_GG_22182_c1_g1_i2.mrna1 Dir  7_A1  -5.20 Kcal/mol
    

    【讨论】:

    • 谢谢!它部分有效。不知何故,当我尝试输入我的数据(比示例中的大)时,一些数据已经丢失。 &gt; length(a) [1] 3577359 &gt; dim(res) [1] 1754993 4
    猜你喜欢
    • 2021-10-17
    • 1970-01-01
    • 2023-03-12
    • 1970-01-01
    • 1970-01-01
    • 2014-05-05
    • 2016-04-07
    • 2018-02-17
    • 2020-09-22
    相关资源
    最近更新 更多