【问题标题】:extracting values from column using tidyr使用 tidyr 从列中提取值
【发布时间】:2023-03-30 11:08:01
【问题描述】:

我将 data.frame annot 定义为:

annot <- structure(list(Name = c("dd_1", "dd_2", "dd_3","dd_4", "dd_5", "dd_6","dd_7"), GOs = 
c("C:extracellular space; C:cell body; P:cell migration process; P:NF/ß pathway", 
   "C:Signal transduction; C:nucleus; F:positive regulation; P:single organism; P:positive(+) regulation",
   "C:cardiomyceltes; C:intracellular pace; F:putative; F:magnesium ion binding; F:calcium ion binding; P:visual perception; P:blood coagulation",
   "F:poly(A) RNA binding; P:DNA-templated transcription, initiation",
    "C:ULK1-ATG13-FIP200 complex; F:histone-arginine N-methyltransferase activity; P:single-organism cellular process",
    "F:3'-5' DNA helicase activity; P:acetate-CoA ligase activity",
    "F:UDP-N-acetylmuramoylalanyl-D-glutamyl-2,6-diaminopimelate-D-alanyl-D-alanine ligase activity; P:oxidoreductase activity, acting on the aldehyde or oxo group of donors, NAD or NADP as acceptor"
)), .Names = c("Name", "GOs"), class = "data.frame", row.names = c(NA, 
-7L))

data.frame 如下所示:

Name     GOs
dd_1     C:extracellular space; C:cell body; P:cell migration process; P:NF/ß pathway 
dd_2     C:Signal transduction; C:nucleus; F:positive regulation; P:single organism; P:positive(+) regulation
dd_3     C:cardiomyceltes; C:intracellular pace; F:putative; F:magnesium ion binding; F:calcium ion binding; P:visual perception; P:blood coagulation
dd_4     F:poly(A) RNA binding; P:DNA-templated transcription, initiation
dd_5     C:ULK1-ATG13-FIP200 complex; F:histone-arginine N-methyltransferase activity; P:single-organism cellular process
dd_6     F:3'-5' DNA helicase activity; P:acetate-CoA ligase activity
dd_7     F:UDP-N-acetylmuramoylalanyl-D-glutamyl-2,6-diaminopimelate-D-alanyl-D-alanine ligase activity; P:oxidoreductase activity, acting on the aldehyde or oxo group of donors, NAD or NADP as acceptor

每个条目都包含 C、F、P 中的单词、特殊字符、字母数字字符。我想将与 C:xxx;F:yyy:P:zzz 对应的所有值拆分为单独的列,其对应值如下:

Name   Component                             Function                  P
dd_1   C:extracellular space;C:cell body     F:transport carrier       P:cell migration process;P:NF/ß pathway  
dd_2   C:Signal transduction;C:nucleus       F:positive regulation     P:single organism;P:positive regulation 
dd_3   C:cardiomyceltes;C:intracellular pace F:magnesium ion           P:visual perception;P:blood coagulationbinding;F:calcium ion binding; 
dd_4                                         F:poly(A) RNA binding;    P:DNA-templated transcription, initiation
dd_5   C:ULK1-ATG13-FIP200 complex           F:histone-arginine N-methyltransferase activity               P:single-organism cellular process
dd_6                                         F:3'-5' DNA helicase activity; P:acetate-CoA ligase activity
dd_7                                         F:UDP-N-acetylmuramoylalanyl-D-glutamyl-2,6-diaminopimelate-D-alanyl-D-alanine ligase activity P:oxidoreductase activity, acting on the aldehyde or oxo group of donors, NAD or NADP as acceptor

我尝试使用 tidyr 在 R 中执行以下命令

separate(annot, GOs, into = c("P", "F", "C"), sep = "[a-z]+=")

但它返回了以下错误:

Error: Values not split into 3 pieces at 1, 2, 3,4

【问题讨论】:

  • 请检查更新后的解决方案是否有效。
  • @akrun 您以前的 tidyr 对我提供的示例数据工作得很好。但是在我的原始文件中,有很多特殊字符,例如 ()、/、0-9、、,而且在可能的行中,只有 P 或 F 或 C,所以 tidyr 向我抛出错误不能为许多条目进行正则表达式,但是对于与我之前提供的数据相似的行工作得很好。现在我用尽可能多的类型更新了数据集。我也会尽快与你分享文件
  • 您的新数据集在更新后的 strsplit 解决方案中运行良好。

标签: regex r gsub tidyr


【解决方案1】:

我认为你最好使用这样的整洁格式:

library(tidyr)
library(dplyr)
annot %>%
  tbl_df() %>%
  mutate(GOs = strsplit(GOs, "; ")) %>% # split each GO into a vector
  unnest(GOs) %>%  # unnest the vectors into multiple rows
  separate(GOs, c("type", "value"), ":") 
#> Source: local data frame [25 x 3]
#> 
#>    Name type                  value
#> 1  dd_1    C    extracellular space
#> 2  dd_1    C              cell body
#> 3  dd_1    P cell migration process
#> 4  dd_1    P           NF/ß pathway
#> 5  dd_2    C    Signal transduction
#> 6  dd_2    C                nucleus
#> 7  dd_2    F    positive regulation
#> 8  dd_2    P        single organism
#> 9  dd_2    P positive(+) regulation
#> 10 dd_3    C         cardiomyceltes
#> ..  ...  ...                    ...

【讨论】:

    【解决方案2】:

    你可以试试strsplit

    res <- do.call(rbind.data.frame,lapply(strsplit(annot$GOs, ";"), 
          function(x) tapply(x, sub(':.*', '', x), FUN=paste, collapse=";")))
    
    res1 <-  data.frame(Name=annot[,1], setNames(res, c('Component',
         'Function', 'P')), stringsAsFactors=FALSE)
    
    res1
    #   Name                             Component
    #1 dd_1     C:extracellular space;C:cell body
    #2 dd_2       C:Signal transduction;C:nucleus
    #3 dd_3 C:cardiomyceltes;C:intracellular pace
    #                                                 Function
    #1                                      F:transport carrier
    #2                                    F:positive regulation
    #3 F:putative;F:magnesium ion binding;F:calcium ion binding
    #                                       P
    #1 P:cell migration process;P:NF/ß pathway
    #2 P:single organism;P:positive regulation
    #3 P:visual perception;P:blood coagulation
    

    或者你可以试试 extracttidyr

    library(tidyr)
    extract(annot, GOs, c('C', 'F', 'P'), '(C:[^F]+);(F:[^P]+);(P:.*)')
    # Name                                      C
    #1 dd_1     C:extracellular space;C:cell body
    #2 dd_2       C:Signal transduction;C:nucleus
    #3 dd_3 C:cardiomyceltes;C:intracellular pace
    #                                                        F
    #1                                      F:transport carrier
    #2                                    F:positive regulation
    #3 F:putative;F:magnesium ion binding;F:calcium ion binding
    #                                       P
    #1 P:cell migration process;P:NF/ß pathway
    #2 P:single organism;P:positive regulation
    #3 P:visual perception;P:blood coagulation
    

    更新

    新数据集的每一行都缺少一些元素(即“C”、“F”等)。您可以修改第一个解决方案

    res <- do.call(rbind.data.frame,lapply(strsplit(annot$GOs, "; "),function(x){
          x1 <- tapply(x, sub(':.*', '', x), FUN=paste, collapse=";")
          x1[match(c('C', 'F', 'P'),  names(x1))]}))
     res1 <-  data.frame(Name=annot[,1], setNames(res, c('Component',
              'Function', 'P')), stringsAsFactors=FALSE)
     head(res1,2)
     #  Name                         Component              Function
     #1 dd_1 C:extracellular space;C:cell body                  <NA>
     #2 dd_2   C:Signal transduction;C:nucleus F:positive regulation
     #                                          P
     #1    P:cell migration process;P:NF/ß pathway
     #2 P:single organism;P:positive(+) regulation
    

    【讨论】:

    • 我试过你的 strsplit 函数。但是当我运行命令时,它给了我一个错误,指出“strsplit(annot$GOs, ";") 中的错误:非字符参数”
    • 抱歉给您带来了困惑 - 我将 dput 添加到问题中,并认为该列将是字符..
    • @docendodiscimus 没关系。
    • @docendodiscimus 我在我的实际数据集上尝试了 tidyr 代码。它给了我错误“UseMethod中的错误(“extract_”):没有适用于“extract_”的适用方法应用于“因子”类的对象。我该如何纠正它
    • @akrun 我试过了,但它给出了以下错误 UseMethod("extract_") 中的错误:没有适用于 'extract_' 的方法应用于“字符”类的对象
    猜你喜欢
    • 1970-01-01
    • 2020-04-18
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2015-07-22
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多