【问题标题】:R: can extracted strings are saved into one column as separated characters?R:可以将提取的字符串作为分隔字符保存到一列中吗?
【发布时间】:2020-03-25 21:57:40
【问题描述】:

假设我需要根据评论行中的句子为人们分配课程。 (实际数据比这个复杂,我简化了) 因此,我使用带有 remathces()、gsub() 和 gregexpr() 的正则表达式从数据中的注释句子中提取字符串。然后将列表保存到列并将它们组合为字符,如下所示。

>cbind.data.frame(level,software,month,stringsAsFactors = FALSE) 

   level                         software             month
1  c("beginner1","beginner2")    c++                  Dec       
2                      NA        Java                 Jan       
3             "beginner3"        NA                   May   
4         "intermediate2"        NA                   NA      
5                      NA        Matlab               Mar    
6             "advanced1"        c("java","c++")      Jul     

我想用

将所有字符分成一列

-将列表 c("beginner1","beginner2") 分解为 "beginner1","beginner2"

-丢弃不适用

-保留为如下字符

  newcol
 "beginner1","beginner2","c++","Dec" 
 "Java","Jan" 
 "beginner3", "May"
 "intermediate2" 
 "Matlab", "Mar"    
 "advanced1","java","c++","Jul"  

但是,当我合并时,它被合并为一个字符。

> newcol<-unite(combined, newcol, 1:ncol(combined), remove=TRUE, sep = ",")

 "beginner1,beginner2,c++,Dec"  
 "Java,Jan" 
 "beginner3, May"
 "intermediate2" 
 "Matlab, Mar"    
 "advanced1,java,c++,Jul"  

是否可以将多个字符作为分隔字符保存到一列中?

【问题讨论】:

    标签: r string extract


    【解决方案1】:

    这是一个基本的 R 解决方案,使用

    f <- Vectorize(function(u) {
      z <- unlist(regmatches(u,gregexpr('\".*?\"',u,perl = T)))
      if (length(z)> 0) {
        r <- gsub('\"',"",z)
      } else {
        r <- u
      }
      r
    })
    
    df$newcol <- apply(df,1,function(x) f(na.omit(x)))
    

    这样

    > df
                           level        software month                         newcol
    1 c("beginner1","beginner2")             c++   Dec beginner1, beginner2, c++, Dec
    2                       <NA>            Java   Jan                      Java, Jan
    3                  beginner3            <NA>   May                 beginner3, May
    4              intermediate2            <NA>  <NA>                  intermediate2
    5                       <NA>          Matlab   Mar                    Matlab, Mar
    6                  advanced1 c("java","c++")   Jul      advanced1, java, c++, Jul
    

    在哪里

    > df$newcol
    $`1`
    $`1`$level
    [1] "beginner1" "beginner2"
    
    $`1`$software
    [1] "c++"
    
    $`1`$month
    [1] "Dec"
    
    
    $`2`
    $`2`$software
    [1] "Java"
    
    $`2`$month
    [1] "Jan"
    
    
    $`3`
    $`3`$level
    [1] "beginner3"
    
    $`3`$month
    [1] "May"
    
    
    $`4`
    $`4`$level
    [1] "intermediate2"
    
    
    $`5`
    $`5`$software
    [1] "Matlab"
    
    $`5`$month
    [1] "Mar"
    
    
    $`6`
    $`6`$level
    [1] "advanced1"
    
    $`6`$software
    [1] "java" "c++" 
    
    $`6`$month
    [1] "Jul"
    

    数据

    df <- structure(list(level = c("c(\"beginner1\",\"beginner2\")", NA, 
    "beginner3", "intermediate2", NA, "advanced1"), software = c("c++", 
    "Java", NA, NA, "Matlab", "c(\"java\",\"c++\")"), month = c("Dec", 
    "Jan", "May", NA, "Mar", "Jul")), class = "data.frame", row.names = c("1", 
    "2", "3", "4", "5", "6"))
    

    【讨论】:

    • 这太棒了,非常感谢你,但是 > str(df $newcol) chr [1:6] "beginner1,beginner2,c++,Dec" ...所以每一行仍然被读作一个字符串,而不是“beginner1”、“beginner2”、“c++”、“Dec”。所以分隔的字符不能存储在列中?
    • @rocknRrr 你能dput()你的数据吗?如果它们可以存储在列中,我可能会再试一次
    • 我希望我可以分享我的数据,但它是机密的……但我的数据与你创建的结构相同,df。到目前为止,我的理解是,不可能将多个逗号分隔的字符存储到一个变量(或单元格)中.....非常感谢!
    • @rocknRrr 我认为可以将东西存储在一个单元格中。请看我的更新
    • @rocknRrr 我发现as.list 中的函数不需要使用apply,这样可以稍微简化代码。请看我的更新
    【解决方案2】:

    这有帮助吗?

    A<-data.frame(a=c("a","b","c"),b=c("a","b","c"),c=c("a","b","c"))
    
    apply(A,2,paste,collapse=",")
    

    【讨论】:

    • apply 创建 3 列,但我想将它们放入一列...
    • A&lt;-data.frame(a=c("a","b","c"),b=c("d","e","f"),c=c("g","h","i")) apply(A,1,paste,collapse=",") with collapse=","
    猜你喜欢
    • 2011-03-18
    • 2021-11-04
    • 2020-11-08
    • 2016-12-29
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多