【问题标题】:Extracting Words of specific length in R using regular expressions使用正则表达式在 R 中提取特定长度的单词
【发布时间】:2012-12-10 08:25:12
【问题描述】:

我有一个类似的代码(我知道了here):

m<- c("Hello! #London is gr8. I really likewhatishappening here! The alcomb of Mount Everest is excellent! the aforementioned place is amazing! #Wow")

x<- gsub("\\<[a-z]\\{4,10\\}\\>","",m)
x

我尝试了其他方法,例如

m<- c("Hello! #London is gr8. I really likewhatishappening here! The alcomb of Mount Everest is excellent! the aforementioned place is amazing! #Wow")

x<- gsub("[^(\\b.{4,10}\\b)]","",m)
x

我需要删除长度小于 4 或大于 10 的单词。我哪里错了?

【问题讨论】:

    标签: regex string r


    【解决方案1】:
      gsub("\\b[a-zA-Z0-9]{4,10}\\b", "", m) 
     "! # is gr8. I  likewhatishappening ! The  of   is ! the aforementioned  is ! #Wow"
    

    让我们解释一下正则表达式的术语:

    1. \b 在称为“单词边界”的位置匹配。此匹配项的长度为零。
    2. [a-zA-Z0-9]:字母数字
    3. {4,10} :{min,max}

    如果你想得到这个的否定,你把它放在 between() 和你 //1

    gsub("([\\b[a-zA-Z0-9]{4,10}\\b])", "//1", m) 
    

    “你好!#London 是 gr8。我真的很喜欢这里发生的事情!珠穆朗玛峰的 alcomb 很棒!前面提到的地方太棒了!#Wow”

    有趣的是,在 2 个正则表达式中存在 4 个字母的单词。

    【讨论】:

    • x
    • @jackStinger 你说“我需要删除长度小于 4 或大于 10 的单词”。我哪里错了?”
    • x
    【解决方案2】:
    # starting string
    m <- c("Hello! #London is gr8. I really likewhatishappening here! The alcomb of Mount Everest is excellent! the aforementioned place is amazing! #Wow")
    
    # remove punctuation (optional)
    v <- gsub("[[:punct:]]", " ", m)
    
    # split into distinct words
    w <- strsplit( v , " " )
    
    # calculate the length of each word
    x <- nchar( w[[1]] )
    
    # keep only words with length 4, 5, 6, 7, 8, 9, or 10
    y <- w[[1]][ x %in% 4:10 ]
    
    # string 'em back together
    z <- paste( unlist( y ), collapse = " " )
    
    # voila
    z
    

    【讨论】:

      【解决方案3】:
      gsub(" [^ ]{1,3} | [^ ]{11,} "," ",m)
      [1] "Hello! #London gr8. really here! alcomb Mount Everest excellent! aforementioned
           place amazing! #Wow"
      

      【讨论】:

        【解决方案4】:

        这可能会让你开始:

        m <- c("Hello! #London is gr8. I really likewhatishappening here! The alcomb of Mount Everest is excellent! the aforementioned place is amazing! #Wow")
        y <- gsub("\\b[a-zA-Z0-9]{1,3}\\b", "", m) # replace words shorter than 4
        y <- gsub("\\b[a-zA-Z0-9]{10,}\\b", "", y) # replace words longer than 10
        y <- gsub("\\s+\\.\\s+ ", ". ", y) # replace stray dots, eg "Foo  .  Bar" -> "Foo. Bar"
        y <- gsub("\\s+", " ", y) # replace multiple spaces with one space
        y <- gsub("#\\b+", "", y) # remove leftover hash characters from hashtags
        y <- gsub("^\\s+|\\s+$", "", y) # remove leading and trailing whitespaces
        y
        # [1] "Hello! London. really here! alcomb Mount Everest excellent! place amazing!"
        

        【讨论】:

          【解决方案5】:

          来自 Alaxender & agstudy 的回答:

          x<- gsub("\\b[a-zA-Z0-9]{1,3}\\b|\\b[a-zA-Z0-9]{10,}\\b", "", m)
          

          现在工作!

          非常感谢,伙计们!

          【讨论】:

            【解决方案6】:

            我不熟悉 R,也不知道它在正则表达式模式中支持哪些类或其他特性。没有它们,模式将是这样的

            [^A-z0-9]([A-z0-9]{1,3}|[A-z0-9]{11,})[^A-z0-9]
            

            【讨论】:

            • x
            • a-Z 是一个无效的字符范围,因为a 在 ASCII 表中位于 Z 之后。您可以更改顺序,但正确的是 a-zA-Z 或使用 [:alpha:] 匹配所有 ASCII 字符。
            • 仍然报错:m x x [1] "你好!#London gr8。真的在这里!alcomb Mount 是前面提到的好地方!#哇”“上述”有什么问题??????
            猜你喜欢
            • 1970-01-01
            • 1970-01-01
            • 1970-01-01
            • 2019-12-26
            • 1970-01-01
            • 2012-02-21
            • 1970-01-01
            • 1970-01-01
            相关资源
            最近更新 更多