用于保留大小写模式、大写的正则表达式答案

【问题标题】：regex for preserving case pattern, capitalization用于保留大小写模式、大写的正则表达式
【发布时间】：2014-11-28 01:28:50
【问题描述】：

是否有正则表达式可以保留\U 和\L 的大小写模式？

在下面的示例中，我想将"date" 转换为"month"，同时保持input 中使用的大小写

   from        to
  "date" ~~> "month"
  "Date" ~~> "Month"
  "DATE" ~~> "MONTH"

我目前使用三个对sub 的嵌套调用来完成此操作。

input <- c("date", "Date", "DATE")
expected.out <- c("month", "Month", "MONTH")

sub("date", "month", 
  sub("Date", "Month", 
    sub("DATE", "MONTH", input)
  )
)

目标是有一个pattern和一个replace，比如

gsub("(date)", "\\Umonth", input, perl=TRUE)

这将产生所需的输出

【问题讨论】：

为什么不通过命名矢量使用地图：map <- setNames(expected.output, input)。然后做month <- map[date]。
@flodel - 聪明的想法 - 这里真的不需要任何正则表达式。
@flodel -- 我怀疑 Ricardo 也想要一个适用于input <- "Here are a date, a Date, and a DATE"等输入的解决方案
是的，正是@JoshO'Brien。 Flodel，为了这个问题，我试图简化这个例子。也许我过于简单了
我的直觉说你不能用一个正则表达式来做；使用for 循环或使用Reduce。

标签： regex r

【解决方案1】：

这是我认为for 循环是合理的情况之一：

input <- rep("Here are a date, a Date, and a DATE",2)
pat <- c("date", "Date", "DATE")
ret <- c("month", "Month", "MONTH")

for(i in seq_along(pat)) { input <- gsub(pat[i],ret[i],input) }
input
#[1] "Here are a month, a Month, and a MONTH" 
#[2] "Here are a month, a Month, and a MONTH"

还有一个由@flodel 提供的替代方案，通过Reduce 实现与循环相同的逻辑：

Reduce(function(str, args) gsub(args[1], args[2], str), 
       Map(c, pat, ret), init = input)

有关这些选项的一些基准测试，请参阅@TylerRinker 的回答。

【讨论】：

【解决方案2】：

使用gsubfn 包，您可以避免使用嵌套的子函数，并在一次调用中完成。

> library(gsubfn)
> x <- 'Here we have a date, a different Date, and a DATE'
> gsubfn('date', list('date'='month','Date'='Month','DATE'='MONTH'), x, ignore.case=T)
# [1] "Here we have a month, a different Month, and a MONTH"

【讨论】：

gsubfn() 调用的替换参数是依赖于“日期”大小写的三个替换的列表。但是你能解释一下为什么 list(...) 是 R 理解为进行替换的函数吗？抱歉，如果不清楚。也许你可以解释那个电话在做什么。谢谢
@lawyeR - 因为ignore.case=TRUE 函数将模式date 匹配到（date 或Date 或DATE），然后查找替换@987654328 中的匹配项@。因此，如果匹配了 Date，它会抓取 list(..)[["Date"]]，在这种情况下为 Month。

【解决方案3】：

这是一种 qdap 方法。非常简单，但不是最快的：

input <- rep("Here are a date, a Date, and a DATE",2)
pat <- c("date", "Date", "DATE")
ret <- c("month", "Month", "MONTH")


library(qdap)
mgsub(pat, ret, input)

## [1] "Here are a month, a Month, and a MONTH"
## [2] "Here are a month, a Month, and a MONTH"

基准测试：

input <- rep("Here are a date, a Date, and a DATE",1000)

library(microbenchmark)

(op <- microbenchmark( 
    GSUBFN = gsubfn('date', list('date'='month','Date'='Month','DATE'='MONTH'), 
             input, ignore.case=T),
    QDAP = mgsub(pat, ret, input),
    REDUCE = Reduce(function(str, args) gsub(args[1], args[2], str), 
       Map(c, pat, ret), init = input),
    FOR = function() {
       for(i in seq_along(pat)) { 
          input <- gsub(pat[i],ret[i],input) 
       }
       input
    },

times=100L))

## Unit: milliseconds
##    expr        min         lq     median         uq        max neval
##  GSUBFN 682.549812 815.908385 847.361883 925.385557 1186.66743   100
##    QDAP  10.499195  12.217805  13.059149  13.912157   25.77868   100
##  REDUCE   4.267602   5.184986   5.482151   5.679251   28.57819   100
##     FOR   4.244743   5.148132   5.434801   5.870518   10.28833   100

【讨论】：

我想选择这个作为答案，只是为了基准测试:)
qdap 方法较慢，因为它对模式进行了一些重新排序，以确保首先出现更多/更大的 n 个字符子/替换，不太可能首先被较小的替换覆盖.如果这没有意义，请意识到内置保护措施。