【问题标题】:(Extract/Separate/Match) Groups in Any Order(提取/分离/匹配)任意顺序的组
【发布时间】:2019-01-20 13:22:08
【问题描述】:
# Sample Data Frame
df  <- data.frame(Column_A 
                  =c("1011 Red Cat", 
                     "Mouse 2011 is in the House 3001", "Yellow on Blue Dog walked around Park"))

我有一列手动输入的数据,我正在尝试清理这些数据。

  Column_A 
1|1011 Red Cat                         |
2|Mouse 2011 is in the House 3001      |
2|Yellow on Blue Dog walked around Park|  

我想将每个特征分离到它自己的列中,但仍然保留 A 列以便稍后提取其他特征。

  Colour               Code           Column_A
1|Red                 |1001          |Cat
2|NA                  |2001 3001     |Mouse is in the House
3|Yellow on Blue      |NA            |Dog walked around Park

迄今为止,我一直在使用 gsub 和捕获组重新排序它们,然后使用 Tidyr::extract 将它们分开。

library(dplyr)
library(tidyr)
library(stringr)

df1 <- df %>% 

  # Reorders the Colours
  mutate(Column_A = gsub("(.*?)?(Yellow|Blue|Red)(.*)?", "\\2 \\1\\3", 
                         Column_A, perl = TRUE)) %>%
  # Removes Whitespaces 
  mutate(Column_A =str_squish(Column_A)) %>%
  # Extracts the Colours 
  extract(Column_A, c("Colour", "Column_A"), "(Red|Yellow|Blue)?(.*)") %>%

  # Repeats the Prececding Steps for Codes
  mutate(Column_A = gsub("(.*?)?(\\b\\d{1,}\\b)(.*)?", "\\2 \\1\\3", 
                         Column_A, perl = TRUE)) %>%
  mutate(Column_A =str_squish(Column_A)) %>%
  extract(Column_A, c("Code", "Column_A"), "(\\b\\d{1,}\\b)?(.*)") %>%
  mutate(Column_A = str_squish(Column_A))

结果如下:

Colour      Code    Column_A
|Red        |1011   |Cat
|Yellow     |NA     |on Blue Dog walked around Park
|NA         |1011   |Mouse is in the House 1001

这适用于第一行,但不适用于前面的空格和单词分隔的行,我随后一直在提取和合并它们。有什么更优雅的方式来做到这一点?

【问题讨论】:

  • 对于你可以做的代码a = trimws(gsub("\\s+"," ",gsub("\\D"," ",df$Column_A)))
  • 你可以做的颜色b = sub("(.*(Red|Yellow|Blue)).*","\\1",sub("^((?!(Blue|Red|Yellow)).)*","",as.matrix(df),perl = TRUE))
  • 谢谢,但我也确实需要删除 A 列中的信息。 Tidyverse 也有类似的东西。 mutate(Colour= sapply(str_extract_all(Column_A,"Red|Yellow|Blue"),paste, collapse=" "))

标签: r regex tidyverse tidyr stringr


【解决方案1】:

这是一个结合stringrgsub 的解决方案,使用R 中提供的颜色列表:

library(dplyr)
library(stringr)

# list of colours from R colors()
cols <- as.character(colors())

apply(df,
      1,
      function(x)

        tibble(
          # Exctract CSV of colours
          Color = cols[cols %in% str_split(tolower(x), " ", simplify = T)] %>%
            paste0(collapse = ","),

          # Extract CSV of sequential lists of digits
          Code = str_extract_all(x, regex("\\d+"), simplify = T) %>%
            paste0(collapse = ","),

          # Remove colours and digits from Column_A
          Column_A = gsub(paste0("(\\d+|",
                                 paste0(cols, collapse = "|"),
                                 ")"), "", x, ignore.case = T) %>% trimws())) %>%
  bind_rows()

# A tibble: 3 x 3
  Color       Code      Column_A                  
  <chr>       <chr>     <chr>                     
1 red         1011      Cat                       
2 ""          2011,3001 Mouse  is in the House    
3 blue,yellow ""        on  Dog walked around Park

【讨论】:

    【解决方案2】:

    使用tidyverse 我们可以做到

    library(tidyverse)
    
    colors <- paste0(c("Red", "Yellow", "Blue"), collapse = "|")
    
    df %>%
       mutate(Color = str_extract(Column_A,
                       paste0("(", colors, ").*(", colors, ")|(", colors, ")")),
               Code = str_extract_all(Column_A, "\\d+", ), 
               Column_A = pmap_chr(list(Color, Code, Column_A), function(x, y, z) 
                  trimws(gsub(paste0("\\b", c(x,  y), "\\b", collapse = "|"), "", z))), 
               Code = map_chr(Code, paste, collapse = " "))
    
    
    #                 Column_A         Color      Code
    #1                    Cat            Red      1011
    #2 Mouse  is in the House           <NA> 2011 3001
    #3 Dog walked around Park Yellow on Blue      
    

    我们首先使用str_extract 提取两个colors 之间的文本。您可以在colors 的数据中包含所有可能出现的颜色。我们使用paste0 来构造正则表达式。对于这个例子,它会是

    paste0("(", colors, ").*(", colors, ")|(", colors, ")")
    #[1] "(Red|Yellow|Blue).*(Red|Yellow|Blue)|(Red|Yellow|Blue)"
    

    意思是提取colors之间的文本或仅提取colors

    对于Code 部分,因为我们可以有多个Code 值,我们使用str_extract_all 并从列中获取所有数字。这部分最初存储在一个列表中。

    对于Column_A 值,我们删除在CodeColor 中选择的所有内容,使用gsub 添加单词边界,并保存剩余部分。

    由于我们之前在列表中存储了Code,我们通过折叠它们将它们转换为一个字符串。这将为不匹配的值返回空字符串。如果需要,您可以通过在链中添加Code = replace(Code, Code == "", NA)) 将它们转换回NA

    【讨论】:

    • 完美运行,您只需将列表中的“颜色”更改为颜色即可。
    猜你喜欢
    • 2023-02-21
    • 1970-01-01
    • 2014-08-20
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2016-01-23
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多