【问题标题】:Clean string using gsub and multiple conditions使用 gsub 和多个条件清理字符串
【发布时间】:2020-11-17 20:20:51
【问题描述】:

这个我已经看过了,但不是我需要的:


情况:使用gsub,我想清理字符串。这些是我的条件:

  1. 只保留文字(没有数字或“奇怪”符号)
  2. 将这些单词与(仅一个)' - _ $ . 之一分开。例如:don't, re-loading,come_home,something$col
  3. 保留具体名称,例如package::functionpackage::function()

所以,我有以下内容:

  1. [^A-Za-z]
  2. ([a-z]+)(-|'|_|$)([a-z]+)
  3. ([a-z]+(_*)[a-z]+)(::)([a-z]+(_*)[a-z]+)(\(\))*

示例:

如果我有以下情况:

# Re-loading pkgdown while it's running causes weird behaviour with # the context cache don't
# Needs to handle NA for desc::desc_get()
# Update href of toc anchors , use "-" instead "."
# Keep something$col or here_you::must_stay

我想要

Re-loading pkgdown while it's running causes weird behaviour with the context cache don't
Needs to handle NA for desc::desc_get()
Update href of toc anchors use instead
Keep something$col or here_you::must_stay

问题:我有几个:

A.第二个表达式无法正常工作。目前,它仅适用于 -'

B.如何在 R 中将所有这些组合到一个 gsub 中?我想做gsub(myPatterns, myText) 之类的事情,但不知道如何修复和组合所有这些。

【问题讨论】:

  • 试试trimws(gsub("(?:\\w+::\\w+(?:\\(\\))?|\\p{L}+(?:[-'_$]\\p{L}+)*)(*SKIP)(*F)|[^\\p{L}\\s]", "", myText, perl=TRUE))。见the regex demo
  • 这就像一个魅力!你能把它作为答案吗?

标签: r regex gsub


【解决方案1】:

你可以使用

trimws(gsub("(?:\\w+::\\w+(?:\\(\\))?|\\p{L}+(?:[-'_$]\\p{L}+)*)(*SKIP)(*F)|[^\\p{L}\\s]", "", myText, perl=TRUE))

请参阅regex demo。或者,也可以用一个空格替换多个空格,使用

trimws(gsub("\\s{2,}", " ", gsub("(?:\\w+::\\w+(?:\\(\\))?|\\p{L}+(?:[-'_$]\\p{L}+)*)(*SKIP)(*F)|[^\\p{L}\\s]", "", myText, perl=TRUE)))

详情

  • (?:\w+::\w+(?:\(\))?|\p{L}+(?:[-'_$]\p{L}+)*)(*SKIP)(*F):匹配两种模式之一:
    • \w+::\w+(?:\(\))? - 1+ 个单词字符,::,1+ 个单词字符和一个可选的 () 子字符串
    • | - 或
    • \p{L}+ - 一个或多个 Unicode 字母
    • (?:[-'_$]\p{L}+)* - 0+ 次重复 -'_$,然后是 1+ Unicode 字母
  • (*SKIP)(*F) - 忽略并跳过比赛
  • | - 或
  • [^\p{L}\s] - 除 Unicode 字母和空格外的任何字符

R demo

myText <- c("# Re-loading pkgdown while it's running causes weird behaviour with # the context cache don't",
"# Needs to handle NA for desc::desc_get()",
'# Update href of toc anchors , use "-" instead "."',
"# Keep something$col or here_you::must_stay")
trimws(gsub("\\s{2,}", " ", gsub("(?:\\w+::\\w+(?:\\(\\))?|\\p{L}+(?:[-'_$]\\p{L}+)*)(*SKIP)(*F)|[^\\p{L}\\s]", "", myText, perl=TRUE)))

输出:

[1] "Re-loading pkgdown while it's running causes weird behaviour with the context cache don't"
[2] "Needs to handle NA for desc::desc_get()"                                                  
[3] "Update href of toc anchors use instead"                                                   
[4] "Keep something$col or here_you::must_stay"    

【讨论】:

  • 现在这是一个模式 ++
【解决方案2】:

或者,

txt <- c("# Re-loading pkgdown while it's running causes weird behaviour with # the context cache don't", 
         "# Needs to handle NA for desc::desc_get()",
         "# Update href of toc anchors , use \"-\" instead \".\"", 
         "# Keep something$col or here_you::must_stay")
expect <- c("Re-loading pkgdown while it's running causes weird behaviour with the context cache don't",
            "Needs to handle NA for desc::desc_get()",
            "Update href of toc anchors use instead",
            "Keep something$col or here_you::must_stay")

leadspace <- grepl("^ ", txt)
gre <- gregexpr("\\b(\\s?[[:alpha:]]*(::|[-'_$.])?[[:alpha:]]*(\\(\\))?)\\b", txt)
regmatches(txt, gre, invert = TRUE) <- ""
txt[!leadspace] <- gsub("^ ", "", txt[!leadspace])
identical(expect, txt)
# [1] TRUE

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 2019-01-27
    • 1970-01-01
    • 1970-01-01
    • 2021-08-22
    • 2017-04-04
    • 2019-07-24
    • 2012-01-15
    • 1970-01-01
    相关资源
    最近更新 更多