使用 R 中的正则表达式将所有匹配项提取到新列答案

【问题标题】：Extract all matches to a new column using regex in R使用 R 中的正则表达式将所有匹配项提取到新列
【发布时间】：2020-04-27 11:58:46
【问题描述】：

在我的数据中，我有一列类似于以下示例的打开文本字段数据：

d <- tribble(
  ~x,
  "i am 10 and she is 50",
  "he is 32 and i am 22",
  "he may be 70 and she may be 99",
)

我想使用regex 将所有两位数字提取到一个名为y 的新列中。我有以下代码，它可以很好地提取第一个匹配项：

d %>%
  mutate(y = str_extract(x, "([0-9]{2})"))

# A tibble: 3 x 2
  x                              y    
  <chr>                          <chr>
1 i am 10 and she is 50          10   
2 he is 32 and i am 22           32   
3 he may be 70 and she may be 99 70

但是，有没有办法使用一些标准分隔符（例如逗号）将两个两位数提取到同一列？

【问题讨论】：

这篇文章应该会有所帮助：stackoverflow.com/q/57059625/5325862。澄清一下，您只想提取两位数？

标签： r regex stringr dplyr

【解决方案1】：

我们可以使用str_extract_all 而不是str_extract，因为str_extract 只匹配第一个实例，因为_all 后缀是全局的，并且会提取list 中的所有实例，它可以转换回两列unnest_wider

library(dplyr)
library(tidyr)
library(stringr)
d %>%  
    mutate(out =  str_extract_all(x, "\\d{2}")) %>% 
    unnest_wider(c(out)) %>%
    rename_at(-1, ~ c('y', 'z')) %>%
    type.convert(as.is = TRUE)
# A tibble: 3 x 3
# x                                  y     z
#  <chr>                          <int> <int>
#1 i am 10 and she is 50             10    50
#2 he is 32 and i am 22              32    22
#3 he may be 70 and she may be 99    70    99

如果我们需要以, 作为分隔符的字符串列，在提取到list 后，使用map 循环遍历list，并使用toString 将所有元素连接到单个字符串（包装用于paste(., collapse=", "))

library(purrr)
d %>%
   mutate(y = str_extract_all(x, "\\b\\d{2}\\b") %>%
                 map_chr(toString))
# A tibble: 3 x 2
#  x                              y     
#  <chr>                          <chr> 
#1 i am 10 and she is 50          10, 50
#2 he is 32 and i am 22           32, 22
#3 he may be 70 and she may be 99 70, 99

【讨论】：

我刚试过你的代码，它说找不到函数“unnest_wider”。

【解决方案2】：

我们也可以使用来自tidyr的extract和unite：

library(dplyr)
library(tidyr)

d %>%
  extract(x, c('y', 'z'), regex = "(\\d+)[^\\d]+(\\d+)", remove = FALSE)

输出：

# A tibble: 3 x 3
  x                              y     z    
  <chr>                          <chr> <chr>
1 i am 10 and she is 50          10    50   
2 he is 32 and i am 22           32    22   
3 he may be 70 and she may be 99 70    99

返回单列：

d %>%
  extract(x, c('y', 'z'), regex = "(\\d+)[^\\d]+(\\d+)", remove = FALSE) %>%
  unite('y', y, z, sep = ', ')

输出：

# A tibble: 3 x 3
  x                              y     
  <chr>                          <chr> 
1 i am 10 and she is 50          10, 50
2 he is 32 and i am 22           32, 22
3 he may be 70 and she may be 99 70, 99

【讨论】：