【问题标题】:Regex to extract county name from string正则表达式从字符串中提取县名
【发布时间】:2017-04-29 16:50:39
【问题描述】:

尝试在 R 中创建一个正则表达式以从字符串中提取县名。当然,您不能只抓住“县”一词前面的第一个字,因为有些县的名称只有 2 个或 3 个字。在这个特定的数据集中,还有一些其他棘手的表达式需要解决。这是我的第一次尝试:

library(data.table)

foo <- data.table(foo=c("Unemployment Rate in Southampton County, VA"
                        ,"Personal Income in Southampton County + Franklin City, VA"
                        ,"Mean Commuting Time for Workers in Southampton County, VA"
                        ,"Estimate of People Age 0-17 in Poverty for Southampton County, VA"))

foo[,county:=trimws(regmatches(foo,gregexpr("(?<=\\bfor|in\\b).*?(?=(City|Municipality|County|Borough|Census Area|Parish),)",foo,perl=T)),"both")]

任何帮助将不胜感激!

【问题讨论】:

    标签: r regex data.table


    【解决方案1】:

    另一种策略:使用可能的县名列表:

    library(maps)
    library(stringi)
    counties <- sapply(strsplit(map("county", plot=F)$names,",",T), "[", 2)
    counties <- unique(sub("(.*?):.*", "\\1", counties))
    counties <- sub("^st", "st.?", counties)
    foo=c("Unemployment Rate in Southampton County, VA"
                            ,"Personal Income in Southampton County + Franklin City, VA"
                            ,"Mean Commuting Time for Workers in Southampton County, VA"
                            ,"Estimate of People Age 0-17 in Poverty for Southampton County, VA")
    stri_extract_all_regex(
      foo, paste0("\\b(", paste(counties, collapse = "|"), ")\\b(?!\\s*city)"), case_insensitive=TRUE
    )
    # [[1]]
    # [1] "Southampton"
    # 
    # [[2]]
    # [1] "Southampton"
    # 
    # [[3]]
    # [1] "Southampton"
    # 
    # [[4]]
    # [1] "Southampton"
    

    【讨论】:

    • 你会怎么做才能去掉标题并删除结尾的县名?有时还有其他东西混合在一起
    • 在您的帖子中添加示例 + 预期输出。
    • 例如,“南安普顿县 + 弗吉尼亚州富兰克林市的个人收入”。不想要富兰克林市的部分,但我也需要状态。对于状态,我想我会剪掉最后两个字母。这意味着我可以用你的方法来提取县
    猜你喜欢
    • 2014-08-25
    • 1970-01-01
    • 2020-07-24
    • 1970-01-01
    • 2021-10-19
    • 2018-02-23
    • 1970-01-01
    • 2014-10-17
    • 1970-01-01
    相关资源
    最近更新 更多