您应该能够调试您编写的正则表达式。
> as.regex(pattern2)
<regex> ([\d]+)\.\s((?:[\w]+|[\w]+\s[\w]+))\s(\d\.[\d]+)
Plug it in 在 regex101,您会看到您的字符串并不总是匹配。右边的解释告诉您,在点和数字之间只允许 1 或 2 个空格分隔的单词。此外,WRD([\w]+ 模式)不匹配点和任何其他不是字母、数字或 _ 的字符。现在,你知道你需要匹配你的字符串
^(\d+)\.(.*?)\s*(\d\.\d{2})$
见this regex demo。翻译成 Rebus:
pattern2 <- START %R% # ^ - start of string
capture(one_or_more(DGT)) %R% # (\d+) - Group 1: one or more digits
DOT %R% # \. - a dot
"(.*?)" %R% # (.*?) - Group 2: any 0+ chars as few as possible
zero_or_more(SPC) %R% # \s* - 0+ whitespaces
capture(DGT %R% DOT %R% repeated(DGT, 2)) %R% # (\d\.\d{2}) - Group 3: #.## number
END # $ - end of string
检查:
> pattern2
<regex> ^([\d]+)\.(.*?)[\s]*(\d\.[\d]{2})$
> companies <- c("612. Grt. Am. Mgt. & Inv. 7.33","77. Wickes 4.61","265. Wang Labs 8.75","9. CrossLand Savings 6.32","228. JPS Textile Group 2.00")
> str_match(companies, pattern = pattern2)
[,1] [,2] [,3] [,4]
[1,] "612. Grt. Am. Mgt. & Inv. 7.33" "612" " Grt. Am. Mgt. & Inv." "7.33"
[2,] "77. Wickes 4.61" "77" " Wickes" "4.61"
[3,] "265. Wang Labs 8.75" "265" " Wang Labs" "8.75"
[4,] "9. CrossLand Savings 6.32" "9" " CrossLand Savings" "6.32"
[5,] "228. JPS Textile Group 2.00" "228" " JPS Textile Group" "2.00"
警告:capture(lazy(zero_or_more(ANY_CHAR))) 返回的([.]*?) 模式尽可能少地匹配 0 个或多个点,而不是匹配任何 0+ 个字符,因为 rebus 有一个错误:它包含所有repeated(one_or_more 或zero_or_more)字符与[ 和],一个字符类。这就是“手动”添加(.*?) 的原因。
可以使用[\w\W] / [\s\S] 或[\d\D] 等常见结构解决或解决此问题:
pattern2 <- START %R% # ^ - start of string
capture(one_or_more(DGT)) %R% # (\d+) - Group 1: one or more digits
DOT %R% # \. - a dot
capture( # Group 2 start:
lazy(zero_or_more(char_class(WRD, NOT_WRD))) # - [\w\W] - any 0+ chars as few as possible
) %R% # End of Group 2
zero_or_more(SPC) %R% # \s* - 0+ whitespaces
capture(DGT %R% DOT %R% repeated(DGT, 2)) %R% # (\d\.\d{2}) - Group 3: #.## number
END
检查:
> as.regex(pattern2)
<regex> ^([\d]+)\.([\w\W]*?)[\s]*(\d\.[\d]{2})$
请参阅regex demo。