【发布时间】:2019-01-20 13:22:08
【问题描述】:
# Sample Data Frame
df <- data.frame(Column_A
=c("1011 Red Cat",
"Mouse 2011 is in the House 3001", "Yellow on Blue Dog walked around Park"))
我有一列手动输入的数据,我正在尝试清理这些数据。
Column_A
1|1011 Red Cat |
2|Mouse 2011 is in the House 3001 |
2|Yellow on Blue Dog walked around Park|
我想将每个特征分离到它自己的列中,但仍然保留 A 列以便稍后提取其他特征。
Colour Code Column_A
1|Red |1001 |Cat
2|NA |2001 3001 |Mouse is in the House
3|Yellow on Blue |NA |Dog walked around Park
迄今为止,我一直在使用 gsub 和捕获组重新排序它们,然后使用 Tidyr::extract 将它们分开。
library(dplyr)
library(tidyr)
library(stringr)
df1 <- df %>%
# Reorders the Colours
mutate(Column_A = gsub("(.*?)?(Yellow|Blue|Red)(.*)?", "\\2 \\1\\3",
Column_A, perl = TRUE)) %>%
# Removes Whitespaces
mutate(Column_A =str_squish(Column_A)) %>%
# Extracts the Colours
extract(Column_A, c("Colour", "Column_A"), "(Red|Yellow|Blue)?(.*)") %>%
# Repeats the Prececding Steps for Codes
mutate(Column_A = gsub("(.*?)?(\\b\\d{1,}\\b)(.*)?", "\\2 \\1\\3",
Column_A, perl = TRUE)) %>%
mutate(Column_A =str_squish(Column_A)) %>%
extract(Column_A, c("Code", "Column_A"), "(\\b\\d{1,}\\b)?(.*)") %>%
mutate(Column_A = str_squish(Column_A))
结果如下:
Colour Code Column_A
|Red |1011 |Cat
|Yellow |NA |on Blue Dog walked around Park
|NA |1011 |Mouse is in the House 1001
这适用于第一行,但不适用于前面的空格和单词分隔的行,我随后一直在提取和合并它们。有什么更优雅的方式来做到这一点?
【问题讨论】:
-
对于你可以做的代码
a = trimws(gsub("\\s+"," ",gsub("\\D"," ",df$Column_A))) -
你可以做的颜色
b = sub("(.*(Red|Yellow|Blue)).*","\\1",sub("^((?!(Blue|Red|Yellow)).)*","",as.matrix(df),perl = TRUE)) -
谢谢,但我也确实需要删除 A 列中的信息。 Tidyverse 也有类似的东西。
mutate(Colour= sapply(str_extract_all(Column_A,"Red|Yellow|Blue"),paste, collapse=" "))
标签: r regex tidyverse tidyr stringr