【问题标题】:How to replace vector of regexps with one single expression?如何用一个表达式替换正则表达式的向量?
【发布时间】:2019-10-18 07:33:20
【问题描述】:

我的材料中有大约 100 多个不同的村庄。为了使我的可视化有意义,我需要将它们组合为 22 个城市,如下所示:

TROLLHÄTTAN<-toupper(c("Trollhättan","Sjuntorp","Velanda","Åsaka","Upphärad"))
UDDEVALLA<-toupper(c("UDDEVALLA","KURVERÖD","AMMENÄS","FAGERHULT","LANESUND OCH ÖVERBY",
"LANESUND","ÖVERBY","RESTENÄS OCH ULVESUND","RESTENÄS","ULVESUND","STRAND","UTBY","HOGSTORP","SUND","SMEDSERÖD"))
VÄNERSBORG<-toupper(c("Vänersborg","Vargön","Brålanda","Frändefors","Nordkroken","Katrinedal"))
LYSEKIL<-toupper(c("Lysekil", "Brastad", "Grundsund", "Fiskebäckskil"))
FÄRGELANDA<-toupper(c("Färgelanda","Högsäter","Ödeborg","Stigen"))
MELLERUD<-toupper(c("Mellerud","Dals Rostock","Åsensbruk"))
ED<-toupper(c("Ed"))
BENGTSFORS<-toupper(c("Bengtsfors","Dals Långed","Billingsfors","Bäckefors","Skåpafors"))
ÅMÅL<-toupper(c("Åmål","Tösse","Fengersfors"))
STRÖMSTAD<-toupper(c("Strömstad","Skee","Kebal","Stare"))
TANUM<-toupper(c("Grebbestad","Tanumshede","Fjällbacka","Hamburgsund","Rabbalshede"))
SOTENÄS<-toupper(c("Hunnebostrand","Kungshamn","Smögen","Malmön","Bovallstrand"))
MUNKEDAL<-toupper(c("Munkedal","Dingle","Hällevadsholm","Hedekas","Torreby"))
ORUST<-toupper(c("Svanesund","Ellös","Hälleviksstrand","Mollösund","Henån","Höggeröd","Vindön och Töllås","Varekil","Vindön","Töllås"))
LILLA_EDET<-toupper(c("Lilla Edet","Lödöse","Lilla Edet västra","Göta","Nygård","Hjärtum"))
ALE<-toupper(c("Ale","Nödinge-Nol","Surte","Älvängen","Skepplanda","Alvhem"))
STENUNGSUND<-toupper(c("Jörlanda","Stora Höga","Timmervik","Spekeröd","Stenungsund","Stenungsön","Svartehallen","Svenshögen","Ucklum","Ödsmål"))
TJÖRN<-toupper(c("Bleket","Djupvik och Fagerfjäll","Höviksnäs","Klövedal","Kyrkesund och Bö","Kållekärr","Myggenäs","Rönnäng","Skärhamn","Stora Dyrön",
"Djupvik","Fagerfjäll","Kyrkesund","Bö"))
KUNGÄLV<-toupper(c("Aröd och Timmervik","Diseröd","Duvesjön","Harestad och Nereby","Kareby","Kode","Kovikshamn","Kungälv","Kärna",
"Lundby","Marstrand","Marstrand", "Arvidsvik","Risby","Rishammar","Signehög och Norrmannebo","Solberga","Tjuvkil","Ödsmål och Åseby",
"Ödsmåls mosse och Rörtången","Aröd","Timmervik","Harestad","Nereby","Signehög","Norrmannebo","Ödsmål","Åseby","Ödsmåls mosse","Rörtången"))
ALINGSÅS<-toupper(c("Alingsås","Ingared","Sollebrunn","Västra Bodarna","Gräfsnäs","Hemsjö","Stora Mellby","Hjälmared","Långared","Svanvik",
"Ryd","Magra"))
VARA<-toupper(c("Vara","Kvänum","Tråvad","Jung","Vedum","Larv","Stora Levene","Emtunga","Arentorp"))
ESSUNGA<-toupper(c("Nossebro","Främmestad","Jonslund"))
VÅRGÅRDA<-toupper(c("Vårgårda","Östadkulle","Horla"))
GRÄSTORP<-toupper(c("GRÄSTORP"))
LIDKÖPING<-toupper(c("Lidköping","Lidköping norra","Vinninga","Järpås","Filsbäck","Örslösa","Saleby"))
GÖTEBORG<-toupper(c("Göteborg","Gunnared och Hammarkullen","Torslanda","Billdal","Olofstorp","Donsö","Nolvik","Styrsö","Angered",
"Brännö","Säve","Helgered","Tumlehed","Asperö","Stenared","Vrångö","Gundal och Högås","Gunnared","Hammarkullen","Gundal","Högås"))

我尝试通过 mgsub(textclean 版本)将村庄名称向量替换为自治市的名称,但遇到了问题。例如“strand”(村名)也是一个流行的后缀。这意味着“HUNNEBOSTRAND”被转换为“HUNNEBOUDDEVALLA”,这当然不是最优的。

我尝试用正则表达式来编写我的向量:

LYSEKIL<-toupper(c("^Lysekil$", "^Brastad$", "^Grundsund$", "^Fiskebäckskil$"))

我发现 mgsub 的 textclean 版本无法处理正则表达式。我更改为 mgsub 包,它希望我的向量长度相同,这不是我想要的。 mgsub 的 qdap 版本似乎以类似的方式运行。

有没有办法解决这个问题?

原始数据(已删除敏感部分)

结构(列表(CITY = c(“HENÅN”,NA,“HENÅN”, “ÄLVÄNGEN”,北美,“TROLLHÄTTAN” ), 邮政编码 = c(47395L, NA, 47332L, 44636L, NA, 46157L), COURSEOFFERING_ID = c(97113L, 97113L, 97113L, 97113L, 97113L, 97113L)), row.names = c(1L, 5L, 9L, 12L, 15L, 18L), class= "data.frame")

【问题讨论】:

    标签: r replace gsub


    【解决方案1】:

    为避免村庄名称也是后缀的问题,您可以使用^$ 锚定这些名称的开头和结尾(您的想法是正确的)。但是要将村庄名称替换为对应的市镇名称,则需要使用函数gsub()(或stringr::str_replace_all())。为了安全起见,不必担心哪些名字会成为问题,只需使用^$ 锚定所有村庄名称。

    这是一个选项:

    1. 使用您的 100 多个村庄名称创建一个向量(我以您的前 2 个向量为例):
    all_village_names <- c(
      "Trollhättan", "Sjuntorp", "Velanda", "Åsaka", "Upphärad",  "UDDEVALLA",
      "KURVERÖD", "AMMENÄS", "FAGERHULT", "LANESUND OCH ÖVERBY", "LANESUND",
      "ÖVERBY", "RESTENÄS OCH ULVESUND", "RESTENÄS", "ULVESUND", "STRAND",
      "UTBY", "HOGSTORP", "SUND", "SMEDSERÖD"
    )
    
    1. 使用一系列gsub() 将它们替换为相应的市镇名称(这里我再次使用您的前两行代码):
    library(dplyr)
    
    all_village_names %>%
      gsub("^Trollhättan$|^Sjuntorp$|^Velanda$|^Åsaka$|^Upphärad$", "TROLLHÄTTAN", .) %>%
      gsub("^UDDEVALLA$|^KURVERÖD$|^AMMENÄS$|^FAGERHULT$|^LANESUND OCH ÖVERBY$|^LANESUND$|^ÖVERBY$|^RESTENÄS OCH ULVESUND$|^RESTENÄS$|^ULVESUND$|^STRAND$|^UTBY$|^HOGSTORP$|^SUND$|^SMEDSERÖD$", "UDDEVALLA", .)
    

    您最终将得到一个包含 100 多个元素的向量(每个初始村庄一个),但该向量将仅由您的 22 个城市名称组成。

    使用我使用的示例数据,您可以:

    [1] "TROLLHÄTTAN" "TROLLHÄTTAN" "TROLLHÄTTAN" "TROLLHÄTTAN" "TROLLHÄTTAN"
     [6] "UDDEVALLA"   "UDDEVALLA"   "UDDEVALLA"   "UDDEVALLA"   "UDDEVALLA"  
    [11] "UDDEVALLA"   "UDDEVALLA"   "UDDEVALLA"   "UDDEVALLA"   "UDDEVALLA"  
    [16] "UDDEVALLA"   "UDDEVALLA"   "UDDEVALLA"   "UDDEVALLA"   "UDDEVALLA"  
    

    没有锚定,LANESUND 变成了LANEUDDEVALLA(因为SUND 变成了UDDEVALLA)。但是锚定可以防止这种情况发生。

    【讨论】:

    猜你喜欢
    • 1970-01-01
    • 2012-02-12
    • 2019-01-24
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2014-03-29
    • 2011-03-12
    • 1970-01-01
    相关资源
    最近更新 更多