在正则表达式之后拆分特定字符答案

【问题标题】：Splitting on specific characters after regex在正则表达式之后拆分特定字符
【发布时间】：2019-02-16 01:54:00
【问题描述】：

我有一个看起来像这样的字符串：

s = "discount rates of 5% to 10%, and growth rates of 2% to 3%"

我想根据第一个范围之后的字符来拆分字符串，所以在这种情况下，它将是“10%”之后的逗号。输出看起来像这样

s = c("discount rates of 5% to 10%", " and growth rates of 2% to 3%")

我用来提取范围的正则表达式函数是：

(\\$*\\d*\\.\\d+[%x] (to|and) \\$*\\d*\\.\\d+[%x])

到目前为止，它一直运行良好（某些范围以“x”而不是“%”结尾），但不是在那个正则表达式上分割 - 我需要在它之后的字符上分割。如果更简单，我也可以在最近的空间上进行拆分，这样输出将如下所示：

s = c("discount rates of 5% to 10%," "and growth rates of 2% to 3%")

我想拆分正则表达式之后的任何内容的原因是因为我想保留两个匹配项（这里是“5 到 10%”和“2% 到 3 %")，但将它们放在不同的字符串中。

【问题讨论】：

你现在如何拆分这个正则表达式？
@doviod 我一直在使用 strsplit(s, "(\\$*\\d*\\.\\d+[%x] (to|and) \\$*\\d *\\.\\d+[%x])")
根本不适合我...
不知道为什么，但我想我找到了解决问题的方法：x = regmatches(s, gregexpr("(\\$*\\d*\\.\\d+[ %x]( 到 | 和 | - |-)\\$*\\d*\\.\\d+[%x])", s)) ，然后我取消列出 x

标签： r regex string split

【解决方案1】：

这是怎么回事：

s1 <- "discount rates of 5% to 10%, and growth rates of 2% to 3%"
s2 <- "discount rates of 5% to 10x, and growth rates of 2% to 3%"
sub("\\s*,.*", "", s1) # first range
sub(sub("\\s*,.*", "", s1), "", s1) # second range
substring(sub(sub("\\s*,.*", "", s1), "", s1), 1, 1) # get first character in second range
### solution:
unlist(strsplit(s1, substring(sub(sub("\\s*,.*","", s1), "", s1), 1, 1))) # case 1
#[1] "discount rates of 5% to 10%"   " and growth rates of 2% to 3%"
unlist(strsplit(s2, substring(sub(sub("\\s*,.*","", s2), "", s2), 1, 1))) # case 2
#[1] "discount rates of 5% to 10x"   " and growth rates of 2% to 3%"

【讨论】：

我应该更具体一些 - 虽然这在技术上有效，但我使用的字符串要大得多，我不想在每个逗号/空格上拆分，只拆分出现在范围之后的那些。

【解决方案2】：

我的解决方案可能很迂回，但可能就足够了：

ss<-gsub("(\\d+[%x],)", "\\1XX",s)
s<-unlist(strsplit(ss, split="XX"))

这假设“XX”实际上并没有出现在任何地方，所以用一个不太可能的字符串替换它（我还简化了正则表达式，假设一个数字后跟一个百分比或 x，后跟一个逗号将始终被拆分开）。

【讨论】：