【问题标题】:Remove the string before a certain word with R用 R 删除某个单词之前的字符串
【发布时间】:2018-08-29 22:55:04
【问题描述】:

我有一个需要清理的字符向量。具体来说,我想删除“投票”一词之前的数字。请注意,数字用逗号分隔千位,因此更容易将其视为字符串。

我知道 gsub("*. Votes","", text) 会删除所有内容,但我该如何删除数字?另外,如何将重复的空格折叠成一个空格?

感谢您提供的任何帮助!

示例数据:

text <- "STATE QUESTION NO. 1                       Amendment to Title 15 of the Nevada Revised Statutes Shall Chapter 202 of the Nevada Revised Statutes be amended to prohibit, except in certain circumstances, a person from selling or transferring a firearm to another person unless a federally-licensed dealer first conducts a federal background check on the potential buyer or transferee?                    558,586 Votes"

【问题讨论】:

    标签: r regex gsub


    【解决方案1】:

    你可以使用

    text <- "STATE QUESTION NO. 1                       Amendment to Title 15 of the Nevada Revised Statutes Shall Chapter 202 of the Nevada Revised Statutes be amended to prohibit, except in certain circumstances, a person from selling or transferring a firearm to another person unless a federally-licensed dealer first conducts a federal background check on the potential buyer or transferee?                    558,586 Votes"
    trimws(gsub("(\\s){2,}|\\d[0-9,]*\\s*(Votes)", "\\1\\2", text))
    # => [1] "STATE QUESTION NO. 1 Amendment to Title 15 of the Nevada Revised Statutes Shall Chapter 202 of the Nevada Revised Statutes be amended to prohibit, except in certain circumstances, a person from selling or transferring a firearm to another person unless a federally-licensed dealer first conducts a federal background check on the potential buyer or transferee? Votes"
    

    请参阅online R demoonline regex demo

    详情

    • (\\s){2,} - 匹配 2 个或更多空白字符,同时捕获将在替换模式中使用 \1 占位符重新插入的最后一次出现
    • | - 或
    • \\d - 一个数字
    • [0-9,]* - 0 个或多个数字或逗号
    • \\s* - 0+ 个空白字符
    • (Votes) - 第 2 组(将使用 \2 占位符在输出中恢复):Votes 子字符串。

    请注意,trimws 将删除所有前导/尾随空格。

    【讨论】:

    • 这几乎是完美的,但我该如何删除“投票”呢?只要去掉括号,对吧?
    • 使用gsub("(\\s){2,}|\\d[0-9,]*\\s*Votes", "\\1", text)
    【解决方案2】:

    最简单的方法是stringr

    > library(stringr)
    > regexp <- "-?[[:digit:]]+\\.*,*[[:digit:]]*\\.*,*[[:digit:]]* Votes+"
    > str_extract(text,regexp)
    [1] "558,586 Votes"
    

    要做同样的事情但只提取数字,将其包装在gsub

    > gsub('\\s+[[:alpha:]]+', '', str_extract(text,regexp))
    [1] "558,586"
    

    以下版本会去掉“投票”一词之前的所有数字,即使其中包含逗号或句点:

    > gsub('\\s+[[:alpha:]]+', '', unlist(regmatches (text,gregexpr("-?[[:digit:]]+\\.*,*[[:digit:]]*\\.*,*[[:digit:]]* Votes+",text) )) )
    [1] "558,586"
    

    如果你也想要这个标签,那就扔掉gsub 部分:

    > unlist(regmatches (text,gregexpr("-?[[:digit:]]+\\.*,*[[:digit:]]*\\.*,*[[:digit:]]* Votes+",text) )) 
    [1] "558,586 Votes"
    

    如果你想提取所有的数字:

    > unlist(regmatches (text,gregexpr("-?[[:digit:]]+\\.*,*[[:digit:]]*\\.*,*[[:digit:]]*",text) ))
    [1] "1"       "15"      "202"     "558,586"
    

    【讨论】:

      猜你喜欢
      • 2021-09-19
      • 2016-05-11
      • 2021-12-04
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2022-01-24
      • 1970-01-01
      相关资源
      最近更新 更多