【问题标题】:Subset specific rows by first two strings通过前两个字符串对特定行进行子集
【发布时间】:2020-11-07 14:14:10
【问题描述】:

我有一个数据框:

ID        value
1    package ‘glue’ was built under R version 3.6.2
2    Attaching package: ‘data.table’
3    Attaching package: ‘purrr’
4    Attaching package: ‘dplyr’
5    The following object is masked from ‘package:dplyr’
6    package ‘dplyr’ was built under R version 3.6.2 

如果值以附加包(第 2、3、4 行)开头,我想在第二个单词之后剪切其余部分并保持唯一。所以,想要的结果是:

ID        value
1    package ‘glue’ was built under R version 3.6.2
2    Attaching package:
5    The following object is masked from ‘package:dplyr’
6    package ‘dplyr’ was built under R version 3.6.2 

我怎么能这样做。我知道我应该使用正则表达式:

\\S+\\s+\\S+

但是剩下的怎么做呢?

【问题讨论】:

    标签: r string dataframe


    【解决方案1】:
    df$value <- sub("(Attaching package:)(.*)", "\\1", df$value)
    

    您可以将要省略的部分定义为捕获组,subs 替换参数中引用它,从而有效地删除它。

    要删除 dups,请使用 duplicated:

    df <- df[!duplicated(df$value),] 
    

    【讨论】:

      【解决方案2】:

      我使用 rebus 包提供了一个解决方案:

      library(dplyr)
      library(rebus)
      library(stringi)
      
      df <- dplyr::tibble(ID = c(1:6),
                          value =c("package ‘glue’ was built under R version 3.6.2","Attaching package: ‘data.table’","Attaching package: ‘purrr’","Attaching package: ‘dplyr’","The following object is masked from ‘package:dplyr’","package ‘dplyr’ was built under R version 3.6.2" ))
      
      pattern <- START %R% "Attaching package:"
      
      df %>% 
        dplyr::mutate(NEW = ifelse(stringi::stri_detect_regex(value, pattern), "Attaching package", value)) %>% 
        dplyr::distinct(NEW)
      
        NEW                                                
        <chr>                                              
      1 package ‘glue’ was built under R version 3.6.2     
      2 Attaching package                                  
      3 The following object is masked from ‘package:dplyr’
      4 package ‘dplyr’ was built under R version 3.6.2 
      

      【讨论】:

        【解决方案3】:

        这行得通吗:

        library(dplyr)
        df %>% mutate(value = gsub('(^Attaching package:).*', '\\1',value)) %>% 
        group_by(value) %>% filter(ID == min(ID))
        # A tibble: 4 x 2
        # Groups:   value [4]
             ID value                                              
          <dbl> <chr>                                              
        1     1 package ‘glue’ was built under R version 3.6.2     
        2     2 Attaching package:                                 
        3     5 The following object is masked from ‘package:dplyr’
        4     6 package ‘dplyr’ was built under R version 3.6.2    
         
        

        使用 stringr、str_remove、str_replace 并往后看:

        library(stringr)
        df %>% mutate(value = str_remove(value, '(?<=Attaching package:).*')) %>% group_by(value) %>% filter(ID == min(ID))
        # A tibble: 4 x 2
        # Groups:   value [4]
             ID value                                              
          <dbl> <chr>                                              
        1     1 package ‘glue’ was built under R version 3.6.2     
        2     2 Attaching package:                                 
        3     5 The following object is masked from ‘package:dplyr’
        4     6 package ‘dplyr’ was built under R version 3.6.2  
        
        df %>% mutate(value = str_replace(value, '(^Attaching package:).*', '\\1')) %>% group_by(value) %>% filter(ID == min(ID))
        # A tibble: 4 x 2
        # Groups:   value [4]
             ID value                                              
          <dbl> <chr>                                              
        1     1 package ‘glue’ was built under R version 3.6.2     
        2     2 Attaching package:                                 
        3     5 The following object is masked from ‘package:dplyr’
        4     6 package ‘dplyr’ was built under R version 3.6.2    
        

        【讨论】:

        • @french_fries,已添加 stringr 解决方案,如果适合您,请告诉我。
        猜你喜欢
        • 2016-12-01
        • 2019-05-25
        • 2013-09-02
        • 2022-01-05
        • 1970-01-01
        • 2013-04-24
        • 1970-01-01
        • 1970-01-01
        相关资源
        最近更新 更多