【问题标题】:Replace multiple values using a reference table使用参考表替换多个值
【发布时间】:2018-06-08 16:57:31
【问题描述】:

我正在清理数据库,其中一个字段是“国家”,但是数据库中的国家名称与我需要的输出不匹配。

我虽然使用 str_replace 函数,但我有超过 50 个国家需要修复,所以这不是最有效的方法。我已经准备了一个 CSV 文件,其中包含原始国家/地区输入和我需要参考的输出。

这是我目前所拥有的:

library(stringr)
library(dplyr)
library(tidyr)
library(readxl)
database1<- read_excel("database.xlsx") 
database1$country<str_replace(database1$country,"USA","United States")
database1$country<str_replace(database1$country,"UK","United Kingdom")
database1$country<str_replace(database1$country,"Bolivia","Bolivia,Plurinational State of")
write.csv(database1, "test.csv", row.names=FALSE, fileEncoding = 'UTF 8', na="")

【问题讨论】:

标签: r data-cleaning stringr countries


【解决方案1】:

注意:factor 中的级别和标签必须是唯一的,否则不应包含重复项。

# database1 <- read_excel("database.xlsx")  ## read database excel book
old_names <- c("USA", "UGA", "CHL") ## country abbreviations
new_names <- c("United States", "Uganda", "Chile")  ## country full form

基础 R

database1 <- within( database1, country <- factor( country, levels = old_names, labels = new_names ))

数据表

library('data.table')
setDT(database1)
database1[, country := factor(country, levels = old_names, labels = new_names)]

database1
#          country
# 1: United States
# 2:        Uganda
# 3:         Chile
# 4: United States
# 5:        Uganda
# 6:         Chile
# 7: United States
# 8:        Uganda
# 9:         Chile

数据

database1 <- data.frame(country = c("USA", "UGA", "CHL", "USA", "UGA", "CHL", "USA", "UGA", "CHL"))
#    country
# 1     USA
# 2     UGA
# 3     CHL
# 4     USA
# 5     UGA
# 6     CHL
# 7     USA
# 8     UGA
# 9     CHL

编辑: 您可以创建一个命名向量countries,而不是两个变量,如 old_names 和 new_names。

countries <- c("USA", "UGA", "CHL")
names(countries) <- c("United States", "Uganda", "Chile")
within( database1, country <- factor( country, levels = countries, labels = names(countries) ))

【讨论】:

    【解决方案2】:

    过去曾使用类似的方法使用 .csv 文件进行批量替换来解决此类问题。

    .csv 文件格式示例:

    library(data.table)
    
    ## Generate example replacements csv file to see the format used
    Replacements <- data.table(Old = c("USA","UGA","CHL"),
                               New = c("United States", "Uganda", "Chile"))
    
    fwrite(Replacements,"Replacements.csv")
    

    一旦您有了“Replacements.csv”,您就可以使用它来使用stringi::replace_all_regex() 一次性替换所有名称。 (对于它的价值,几乎整个stringr 包本质上是对stringi 调用的包装。由于stringi 运行速度稍快并且具有更多功能,我更喜欢坚持使用stringi。) See stringi vs stringr blog by HRBRMSTR

    library(data.table)
    library(readxl)
    library(stringi)
    
    ## Read in list of replacements
    Replacements <- fread("Replacements.csv")
    
    ## Read in file to be cleaned
    database1<- read_excel("database.xlsx")
    
    ## Perform Replacements
    database1$countries <- stringi::stri_replace_all_regex(database1$countries,
                                                  "^"%s+%Replacements$Old%s+%"$",
                                                  Replacements$New,
                                                  vectorize_all = FALSE)
    
    ## Write CSV
    write.csv(database1, "test.csv", row.names=FALSE, fileEncoding = 'UTF 8', na="")
    

    我尝试在可能的情况下使用上述基本 R data.frame 语法以避免任何混淆,但如果我是为自己这样做,我会坚持使用完整的 data.table 语法,如下所示:

    library(data.table)
    library(readxl)
    library(stringi)
    
    ## Read in list of replacements
    Replacements <- fread("Replacements.csv")
    
    ## Read in file to be cleaned
    database1<- read_excel("database.xlsx")
    
    ## Perform Replacements
    database1[, countries := stri_replace_all_regex(countries,"^"%s+%Replacements[,Old]%s+%"$",
                                                  Replacements[,New],
                                                  vectorize_all = FALSE)]
    ## Write CSV
    fwrite(database1,"test.csv")
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2021-09-07
      • 1970-01-01
      • 1970-01-01
      • 2013-12-02
      相关资源
      最近更新 更多