df <- structure(list(t = structure(1:2, .Label = c("v1", "v2"), class = "factor"),
d = structure(1:2, .Label = c("something[123,894]", "something[456,4834]"
), class = "factor")), .Names = c("t", "d"), row.names = c(NA,
-2L), class = "data.frame")
这将匹配任意字符.* 和[ 任意次数,然后捕获到组\\1 一位或多位数字\\d+,结束捕获组,后跟任意次数任意字符
df$r <- gsub('.*\\[(\\d+).*', '\\1', df$d)
# t d r
# 1 v1 something[123,894] 123
# 2 v2 something[456,4834] 456
另外,如果你想捕获逗号后的第二个数字字符串,这会更有用:
gsub('.*\\[(\\d+),(\\d+).*', '\\1', df$d)
# [1] "123" "456"
gsub('.*\\[(\\d+),(\\d+).*', '\\2', df$d)
# [1] "894" "4834"
或者如果你想同时做这两个:
cbind(df, do.call('rbind', lapply(strsplit(as.character(df$d), ','),
function(x) gsub('\\D', '', x))))
# t d 1 2
# 1 v1 something[123,894] 123 894
# 2 v2 something[456,4834] 456 4834
This 解释得比我好:
NODE EXPLANATION
--------------------------------------------------------------------------------
.* any character except \n (0 or more times
(matching the most amount possible))
--------------------------------------------------------------------------------
\[ '['
--------------------------------------------------------------------------------
( group and capture to \1:
--------------------------------------------------------------------------------
\d+ digits (0-9) (1 or more times (matching
the most amount possible))
--------------------------------------------------------------------------------
) end of \1
--------------------------------------------------------------------------------
, ','
--------------------------------------------------------------------------------
( group and capture to \2:
--------------------------------------------------------------------------------
\d+ digits (0-9) (1 or more times (matching
the most amount possible))
--------------------------------------------------------------------------------
) end of \2
--------------------------------------------------------------------------------
.* any character except \n (0 or more times
(matching the most amount possible))