您似乎想根据推文中提到的城市在推文中添加一列来标识州。这有几个问题。首先,城市不是唯一的——也就是说,不同州可以有多个同名城市。因此,城市并不能唯一地标识州。其次,可以通过多种方式识别城市。例如,巴西有四个不同的圣保罗,它们都可能以相同的方式被引用,尤其是在推文中。
São Paulo de Olivença
São Paulo do Potengi
São Paulo das Missões
São Paulo
尽管有所有这些保留,这里有一种附加城市和州名的方法。这段代码还处理了推文中提到 no 城市的可能性。
library(raster)
# this generates sample data - you have this already (??)
br <- getData(country="BR",level=2) # Brazil shapefile, admin level 2
# muni$NAME_1 has the state names; muni$NAME_2 has the city names
muni <- br@data # ~5500 municipalities in Brazil
set.seed(1) # for reproduceable example
cities <- muni[sample(1:nrow(muni),90),]$NAME_2 # 90 random cities in brazil
cities <- c(cities,rep("",10)) # last 10% have no city mentioned
tweets <- sapply(1:100,function(i) paste("#random text",cities[i],"more random text"))
# you start here
result <- do.call(rbind,lapply(tweets,function(tweet) {
indx <- sapply(muni$NAME_2, grepl, tweet,fixed=T) # all matching cities
indx <- min(which(indx)) # use only first match!!
muni[indx,c("NAME_2","NAME_1")] # NAME_1 contains the state
}))
tweets <- data.frame(tweets,result)
head(tweets)
# tweets NAME_2 NAME_1
# 1462 #random text Piau more random text Piau Minas Gerais
# 2048 #random text Estiva more random text Estiva Minas Gerais
# 1474 #random text Nova Esperança do Sudoeste more random text Esperança Paraíba
# 4997 #random text Monções more random text Monções São Paulo
# 1110 #random text Goiás more random text Goiás Goiás
# 4941 #random text Jumirim more random text Jumirim São Paulo
tail(tweets)
# tweets NAME_2 NAME_1
# NA4 #random text more random text <NA> <NA>
# NA5 #random text more random text <NA> <NA>
# NA6 #random text more random text <NA> <NA>
# NA7 #random text more random text <NA> <NA>
# NA8 #random text more random text <NA> <NA>
# NA9 #random text more random text <NA> <NA>
这个输出说明了另一个问题:Esperança 匹配,即使提到的实际城市是Nova Esperança do Sudoeste(它处于不同的状态......)。我没有看到解决此问题的简单方法。