【问题标题】:Extract all matches to a new column using regex in R使用 R 中的正则表达式将所有匹配项提取到新列
【发布时间】:2020-04-27 11:58:46
【问题描述】:

在我的数据中,我有一列类似于以下示例的打开文本字段数据:

d <- tribble(
  ~x,
  "i am 10 and she is 50",
  "he is 32 and i am 22",
  "he may be 70 and she may be 99",
)

我想使用regex 将所有两位数字提取到一个名为y 的新列中。我有以下代码,它可以很好地提取第一个匹配项:

d %>%
  mutate(y = str_extract(x, "([0-9]{2})"))

# A tibble: 3 x 2
  x                              y    
  <chr>                          <chr>
1 i am 10 and she is 50          10   
2 he is 32 and i am 22           32   
3 he may be 70 and she may be 99 70 

但是,有没有办法使用一些标准分隔符(例如逗号)将两个两位数提取到同一列?

【问题讨论】:

标签: r regex stringr dplyr


【解决方案1】:

我们可以使用str_extract_all 而不是str_extract,因为str_extract 只匹配第一个实例,因为_all 后缀是全局的,并且会提取list 中的所有实例,它可以转换回两列unnest_wider

library(dplyr)
library(tidyr)
library(stringr)
d %>%  
    mutate(out =  str_extract_all(x, "\\d{2}")) %>% 
    unnest_wider(c(out)) %>%
    rename_at(-1, ~ c('y', 'z')) %>%
    type.convert(as.is = TRUE)
# A tibble: 3 x 3
# x                                  y     z
#  <chr>                          <int> <int>
#1 i am 10 and she is 50             10    50
#2 he is 32 and i am 22              32    22
#3 he may be 70 and she may be 99    70    99

如果我们需要以, 作为分隔符的字符串列,在提取到list 后,使用map 循环遍历list,并使用toString 将所有元素连接到单个字符串(包装用于paste(., collapse=", "))

library(purrr)
d %>%
   mutate(y = str_extract_all(x, "\\b\\d{2}\\b") %>%
                 map_chr(toString))
# A tibble: 3 x 2
#  x                              y     
#  <chr>                          <chr> 
#1 i am 10 and she is 50          10, 50
#2 he is 32 and i am 22           32, 22
#3 he may be 70 and she may be 99 70, 99

【讨论】:

  • 我刚试过你的代码,它说找不到函数“unnest_wider”。
【解决方案2】:

我们也可以使用来自tidyrextractunite

library(dplyr)
library(tidyr)

d %>%
  extract(x, c('y', 'z'), regex = "(\\d+)[^\\d]+(\\d+)", remove = FALSE) 

输出:

# A tibble: 3 x 3
  x                              y     z    
  <chr>                          <chr> <chr>
1 i am 10 and she is 50          10    50   
2 he is 32 and i am 22           32    22   
3 he may be 70 and she may be 99 70    99 

返回单列:

d %>%
  extract(x, c('y', 'z'), regex = "(\\d+)[^\\d]+(\\d+)", remove = FALSE) %>%
  unite('y', y, z, sep = ', ')

输出:

# A tibble: 3 x 3
  x                              y     
  <chr>                          <chr> 
1 i am 10 and she is 50          10, 50
2 he is 32 and i am 22           32, 22
3 he may be 70 and she may be 99 70, 99

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 2013-05-09
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2021-11-03
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多