【问题标题】:Extracting strings from links using regex in R在 R 中使用正则表达式从链接中提取字符串
【发布时间】:2020-12-16 23:26:35
【问题描述】:

我有一个 url 链接列表,我想提取其中一个字符串并将它们保存在另一个变量中。样本数据如下:

  sample<-  c("http://dps.endavadigital.net/owgr/doc/content/archive/2009/owgr01f2009.pdf",
            "http://dps.endavadigital.net/owgr/doc/content/archive/2009/owgr02f2001.pdf",
           "http://dps.endavadigital.net/owgr/doc/content/archive/2009/owgr03f2002.pdf",
          "http://dps.endavadigital.net/owgr/doc/content/archive/2009/owgr04f2004.pdf",
         "http://dps.endavadigital.net/owgr/doc/content/archive/2009/owgr05f2005.pdf",
           "http://dps.endavadigital.net/owgr/doc/content/archive/2009/owgr06f2018.pdf",
           "http://dps.endavadigital.net/owgr/doc/content/archive/2009/owgr07f2016.pdf",
            "http://dps.endavadigital.net/owgr/doc/content/archive/2009/owgr08f2015.pdf",
          "http://dps.endavadigital.net/owgr/doc/content/archive/2009/owgr09f2020.pdf",
             "http://dps.endavadigital.net/owgr/doc/content/archive/2009/owgr10f2014.pdf")

sample

 [1] "http://dps.endavadigital.net/owgr/doc/content/archive/2009/owgr01f2009.pdf"
 [2] "http://dps.endavadigital.net/owgr/doc/content/archive/2009/owgr02f2001.pdf"
 [3] "http://dps.endavadigital.net/owgr/doc/content/archive/2009/owgr03f2002.pdf"
 [4] "http://dps.endavadigital.net/owgr/doc/content/archive/2009/owgr04f2004.pdf"
 [5] "http://dps.endavadigital.net/owgr/doc/content/archive/2009/owgr05f2005.pdf"
 [6] "http://dps.endavadigital.net/owgr/doc/content/archive/2009/owgr06f2018.pdf"
 [7] "http://dps.endavadigital.net/owgr/doc/content/archive/2009/owgr07f2016.pdf"
 [8] "http://dps.endavadigital.net/owgr/doc/content/archive/2009/owgr08f2015.pdf"
 [9] "http://dps.endavadigital.net/owgr/doc/content/archive/2009/owgr09f2020.pdf"
[10] "http://dps.endavadigital.net/owgr/doc/content/archive/2009/owgr10f2014.pdf"

我想使用正则表达式提取星期和年份。

     week year
1     1 2009
2     2 2001
3     3 2002
4     4 2004
5     5 2005
6     6 2018
7     7 2016
8     8 2015
9     9 2020
10   10 2014

【问题讨论】:

    标签: r regex string split stringi


    【解决方案1】:

    您可以使用str_match 来捕获'owgr''f' 之后的数字:

    library(stringr)
    str_match(sample, 'owgr(\\d+)f(\\d+)')[, -1]
    

    您可以将其转换为数据框,将类更改为数字并分配列名。

    setNames(type.convert(data.frame(
              str_match(sample, 'owgr(\\d+)f(\\d+)')[, -1])), c('year', 'week'))
    
    #   year week
    #1     1 2009
    #2     2 2001
    #3     3 2002
    #4     4 2004
    #5     5 2005
    #6     6 2018
    #7     7 2016
    #8     8 2015
    #9     9 2020
    #10   10 2014
    

    另一种方法是从sample 的最后部分提取所有数字。我们可以通过basename 获取最后一部分。

    str_extract_all(basename(sample), '\\d+', simplify = TRUE)
    

    【讨论】:

      【解决方案2】:

      你可以尝试另一种方式

      library(dplyr)
      library(stringr)
      df <- data.frame(sample)
      df2 <- df %>% 
        transmute(year = str_extract(sample, "(?<=wgr)\\d{1,2}(?=f)"), week = str_extract(sample, "(?<=f)\\d{4}(?=\\.pdf)"))
      
      #     year week
      # 1     1 2009
      # 2     2 2001
      # 3     3 2002
      # 4     4 2004
      # 5     5 2005
      # 6     6 2018
      # 7     7 2016
      # 8     8 2015
      # 9     9 2020
      # 10   10 2014
      

      【讨论】:

        【解决方案3】:

        你可以使用 {unglue} :

        library(unglue)
        
        unglue_data(
          sample,
          "http://dps.endavadigital.net/owgr/doc/content/archive/2009/owgr{week}f{year}.pdf")
        #>    week year
        #> 1    01 2009
        #> 2    02 2001
        #> 3    03 2002
        #> 4    04 2004
        #> 5    05 2005
        #> 6    06 2018
        #> 7    07 2016
        #> 8    08 2015
        #> 9    09 2020
        #> 10   10 2014
        

        【讨论】:

          猜你喜欢
          • 1970-01-01
          • 2021-11-28
          • 1970-01-01
          • 1970-01-01
          • 1970-01-01
          • 2015-03-18
          • 2014-10-17
          • 2018-06-23
          • 2011-08-06
          相关资源
          最近更新 更多