正则表达式从字符串中提取县名答案

【问题标题】：Regex to extract county name from string正则表达式从字符串中提取县名
【发布时间】：2017-04-29 16:50:39
【问题描述】：

尝试在 R 中创建一个正则表达式以从字符串中提取县名。当然，您不能只抓住“县”一词前面的第一个字，因为有些县的名称只有 2 个或 3 个字。在这个特定的数据集中，还有一些其他棘手的表达式需要解决。这是我的第一次尝试：

library(data.table)

foo <- data.table(foo=c("Unemployment Rate in Southampton County, VA"
                        ,"Personal Income in Southampton County + Franklin City, VA"
                        ,"Mean Commuting Time for Workers in Southampton County, VA"
                        ,"Estimate of People Age 0-17 in Poverty for Southampton County, VA"))

foo[,county:=trimws(regmatches(foo,gregexpr("(?<=\\bfor|in\\b).*?(?=(City|Municipality|County|Borough|Census Area|Parish),)",foo,perl=T)),"both")]

任何帮助将不胜感激！

【问题讨论】：

标签： r regex data.table

【解决方案1】：

另一种策略：使用可能的县名列表：

library(maps)
library(stringi)
counties <- sapply(strsplit(map("county", plot=F)$names,",",T), "[", 2)
counties <- unique(sub("(.*?):.*", "\\1", counties))
counties <- sub("^st", "st.?", counties)
foo=c("Unemployment Rate in Southampton County, VA"
                        ,"Personal Income in Southampton County + Franklin City, VA"
                        ,"Mean Commuting Time for Workers in Southampton County, VA"
                        ,"Estimate of People Age 0-17 in Poverty for Southampton County, VA")
stri_extract_all_regex(
  foo, paste0("\\b(", paste(counties, collapse = "|"), ")\\b(?!\\s*city)"), case_insensitive=TRUE
)
# [[1]]
# [1] "Southampton"
# 
# [[2]]
# [1] "Southampton"
# 
# [[3]]
# [1] "Southampton"
# 
# [[4]]
# [1] "Southampton"

【讨论】：

你会怎么做才能去掉标题并删除结尾的县名？有时还有其他东西混合在一起
在您的帖子中添加示例 + 预期输出。
例如，“南安普顿县 + 弗吉尼亚州富兰克林市的个人收入”。不想要富兰克林市的部分，但我也需要状态。对于状态，我想我会剪掉最后两个字母。这意味着我可以用你的方法来提取县