【问题标题】：Matching an extracting country name from character string in R匹配从R中的字符串中提取的国家名称
【发布时间】：2018-06-08 12:44:48
【问题描述】：

我一直在从网站上抓取评论数据，在此过程中，我能够获取包含用户名、评论数量、评论日期和国家/地区信息的字符串向量。它们看起来大致是这样的

raw <- c("Anna (1025) - North Carolina, USA - DEC 20, 2017", 
"James (10) - - MEXICO - NOV 22, 2017", 
"Susane (222) - Oulu, FINLAND - JUNE 1, 2016", 
"Alex (20000) - SOUTH KOREA- MAR 11, 2015")

到目前为止，我可以提取名称、评论编号和日期，因为它们位于已定义的位置或具有一致的格式。问题是国家名称格式的定位不一致，每个字符串中的各个数据点也不一致，用逗号或破折号分隔。仅提取大写字符串会导致缺少国家/地区或名称包含两个部分的国家/地区的问题。

地图包包含一个国家/地区列表。有没有一种方法可以让我在stringr 中使用str_extract_all 在国家/地区列表向量中查找匹配项并仅提取它？

【问题讨论】：

您可以通过组合所有国家/地区名称来创建正则表达式搜索字符串：pat = paste(iso3166$ISOname, collapse="|")。然后str_extract_all(raw, regex(pat, ignore.case=TRUE)).
这听起来非常适合您的数据集，可能需要更多测试

标签： r web-scraping dplyr stringr data-processing

【解决方案1】：

您可以使用maps 库执行此操作，如下所示：

library(maps)

## Loading country data from package maps
data(world.cities)

raw <- c("Anna (1025) - North Carolina, USA - DEC 20, 2017", 
     "James (10) - - MEXICO - NOV 22, 2017", 
     "Susane (222) - Oulu, FINLAND - JUNE 1, 2016", 
     "Alex (20000) - SOUTH KOREA- MAR 11, 2015")

###Removing punctuation
raw <- gsub("[[:punct:]\n]","",raw)

# Split data at word boundaries
raw2 <- strsplit(raw, " ")

# Match on country in world.countries
CountryList_raw <- (lapply(raw2, function(x)x[which(toupper(x) %in% toupper(world.cities$country.etc))]))

do.call(rbind, lapply(CountryList_raw, as.data.frame))

#      X[[i]]
#1        USA
#2     MEXICO
#3    FINLAND

这行得通。但是，您需要稍后修复包含多个单词的国家/地区名称。例如，在这种情况下，韩国。这是因为strsplit 正在拆分单词，这就是它无法匹配 SOUTH KOREA 的原因。

【讨论】：

【解决方案2】：

TL;DR

我使用了raw 数据并将其转换为数据框。然后，我使用正则表达式和行迭代的组合逐列提取所需的信息。

导入必要的包和原始数据

要按照教程进行操作，您需要安装以下软件包：

BBmisc：B. Bischl 和其他一些人的各种辅助函数，主要用于包开发。
maps：绘制地理地图。
magrittr：一组使您的代码更具可读性的运算符。
purrr：一个完整且一致的 R 函数式编程工具包。

如果您已经拥有所有这些，则无需使用install.packages() 函数。

install.packages( pkgs = c(  "BBmisc", "maps", "magrittr", "purrr" ) )
library( BBmisc )
library( maps )
library( magrittr )
library( purrr )

raw <- c("Anna (1025) - North Carolina, USA - DEC 20, 2017", 
         "James (10) - - MEXICO - NOV 22, 2017", 
         "Susane (222) - Oulu, FINLAND - JUNE 1, 2016", 
         "Alex (20000) - SOUTH KOREA- MAR 11, 2015")

导入原始数据

raw <- c("Anna (1025) - North Carolina, USA - DEC 20, 2017", 
         "James (10) - - MEXICO - NOV 22, 2017", 
         "Susane (222) - Oulu, FINLAND - JUNE 1, 2016", 
         "Alex (20000) - SOUTH KOREA- MAR 11, 2015")

声明四列

鉴于存储在raw 中的数据，我们认为创建四列是合适的：

user_name：用户名
user_review_number：与用户评论相关联的标识号
user_country：用户所在国家

user_review_date: 日期 - 月日，年格式 - 创建用户评论

raw <- data.frame( user_name = raw
       , user_review_number = raw
       , user_country = raw
       , user_review_date = raw
       , stringsAsFactors = FALSE
       )

正则表达式

Regular Expressions 允许使用特定语法进行复杂而灵活的搜索/替换。它们用于从raw 数据集中提取相关数据。

识别原始$user_name

此列包含括号前的用户名。

raw$user_name <- strsplit( x = raw$user_name
                           , split = "\\(|\\)"
                           , fixed = FALSE 
                           )
# keep only the first element from each list, then unlist to obtain a character vector
raw$user_name <- 
  purrr::map( .x = raw$user_name, .f = 1 ) %>%
  unlist()

# remove trailing whitespace
raw$user_name <- trimws( x = raw$user_name
                         , which = "right"
                         )

识别原始$user_review_number

此列包含用户的评论编号，它是一个整数，在两个括号之间有 1-10 位之间的任意数字。

raw$user_review_number <- strsplit( x = raw$user_review_number
                                    , split = "\\(|\\)"
                                    , fixed = FALSE 
                                    )
# keep only the second element from each list, then unlist to obtain a character vector
# and cast as integer
raw$user_review_number <- 
  purrr::map( .x = raw$user_review_number, .f = 2 ) %>%
  unlist() %>%
  as.integer()

识别原始的$user_country

这个专栏有点诡计。一些国家/地区用逗号分隔，另一些包含两部分名称（即 SOUTH KOREA），一些是缩写（即 USA），还有一些包含州信息（即 North Carolina, USA）。

有一百种方法可以做到这一点，但我使用的逻辑包含以下内容：

多个字符串拆分以说明国家信息存储方式的多样性
使用内置数据集 maps::iso3166 (see page 9 in the PDF) 交叉引用给定的国家名称，其中包含来自 International Standards Organizations (ISO) 的 3,166 个国家/地区代码。

参见下面的代码。

# first, split by the parentheses
raw$user_country <- strsplit( x = raw$user_country
                                    , split = "\\(|\\) "
                                    , fixed = FALSE 
)
# second, keep only the third elements from each list, then unlist to obtain character vector
raw$user_country <- 
  purrr::map( .x = raw$user_country, .f = 3 ) %>%
  unlist()
# third, split by the dash marks, either one or two
raw$user_country <- strsplit( raw$user_country
                          , split = "\\-|\\- \\-"
                          , fixed = FALSE
                          )
# fourth, keep only the second elements from each list, then unlist to obtain character vector
raw$user_country <-
  purrr::map( .x = raw$user_country, .f = 2 ) %>%
  unlist()
# fifth, clear leading and trailing white space
raw$user_country <- trimws( x = raw$user_country
                        , which = "both"
                        )
# sixth, separate states marked by the apperance of a comma
raw$user_country <- strsplit( x = raw$user_country
                         , split = ","
                         , fixed = TRUE
                         ) 
# seventh, make two vectors: 
# one for the first element (which may or not be the state within a country)
maybe.country <- 
  purrr::map( .x = raw$user_country, .f = 1 ) %>%
  unlist()
# one for the second element (which will always be the country)
# note: need to convert NULL elements into NA
definitely.country <-
  purrr::map( .x = raw$user_country, .f = 2, .null = NA ) %>%
  unlist()

# eighth, replace the indices within maybe.country 
#         whose indices in definitely.country are non-NA values
#         with those non-NA values from definitely.country.
# note: this is possible due to the indices within both 
#       maybe.country and definitely.country to be exact equivalents. 
#       (i.e. the 8th element in maybe.country will always align
#        with the 8th element in definitely.country )
maybe.country[
  which( !is.na( definitely.country ) )
  ] <- definitely.country[
    which( !is.na( definitely.country )  )
  ]

# ninth, assign the character vector maybe.country to raw$user_country
raw$user_country <- maybe.country

# tenth, remove all leading and trailing white space
raw$user_country <- trimws( x = raw$user_country
                        , which = "both"
                        )
# eleventh, if the number of letters (length) of any element is more than 3, 
# change the spelling to Capital Case. 
# note: This logic comes from the maps::iso3166 data frame, which contains
#       3,166 country codes from the International Standards Organizations (ISO).
raw$user_country <- ifelse( test = nchar( raw$user_country ) == 2 |
                          nchar( raw$user_country ) == 3
                        , yes = raw$user_country
                        , no = stringr::str_to_title( string = raw$user_country ) 
                        )
# twelfth, check to make sure that all characters are either
# 2 character, 3 character, ISO country codes/names,
# shorter name used in the `maps` package, or the sovereign country
# by ensuring the length of the elements who meet this criteria
# is equal to the length of raw$user_country
length(
  which( raw$user_country %in%  maps::iso3166$a2 |
         raw$user_country %in% maps::iso3166$a3 |
         raw$user_country %in% maps::iso3166$ISOname |
         raw$user_country %in% maps::iso3166$mapname |
         raw$user_country %in% maps::iso3166$sovereignty
       )
) == length( raw$user_country ) # [1] TRUE

识别原始$user_review_date

假设用户的评论数据始终是要存储在每个字符串中的最后一位文本，以下是您将如何去除该特定列的数据。

raw$user_review_date <- strsplit( x = raw$user_review_date
                                  , split = "\\-\\s"
                                  , fixed = FALSE
                                  )

# keep only the last element from each list, 
# unlist to obtain a character vector,
# standardize the dates 
# note: assumes no NAs will appear for date
raw$user_review_date <- 
  purrr::map( .x = raw$user_review_date, .f = BBmisc::getLast ) %>%
  unlist() %>%
  as.Date( format = "%b %d, %Y" )

【讨论】：

【解决方案3】：

如果

国家/地区名称总是大写并且
是第一个以大写形式出现的单词，即，名称永远不会全部大写，月份字段位于国家/地区字段之后

那么我们可以使用下面的正则表达式来提取国家名称：

"[[:upper:]]{2,}[.]?(\\s[[:upper:]]{2,}[.]?)*"

这也适用于多个部分的国家名称或使用点表示缩写的地方：

raw <- c("Anna (1025) - North Carolina, USA - DEC 20, 2017", 
         "James (10) - - MEXICO - NOV 22, 2017", 
         "Susane (222) - Oulu, FINLAND - JUNE 1, 2016", 
         "Alex (20000) - SOUTH KOREA- MAR 11, 2015", 
         "Peter (4711) - KINGDOM OF SOUTH NEVERLAND - DEC 24, 2016", 
         "Paul (0815) - REP. OF NORTH NEVERLAND - DEC 31, 2016")
stringr::str_extract(raw, "[[:upper:]]{2,}[.]?(\\s[[:upper:]]{2,}[.]?)*")

[1] "USA"                        "MEXICO"                     "FINLAND"                   
[4] "SOUTH KOREA"                "KINGDOM OF SOUTH NEVERLAND" "REP. OF NORTH NEVERLAND"

说明

"[[:upper:]]{2,}[.]?(\\s[[:upper:]]{2,}[.]?)*"

正在寻找由 2 个以上大写字母组成的序列，可选地后跟一个点。这将捕获仅包含一个单词的国家/地区名称。

为了捕获由多个单词组成的国家/地区名称，括号中的表达式正在查找由空格和另一个带有可选点的大写单词组成的任意数量的子序列。

请注意，stringr::str_extract() 仅用于提取第一个匹配项以避免同时捕获月份的名称。

【讨论】：

【解决方案4】：

我的解决方案建立在上面的 Santosh 解决方案的基础上，但通过单独搜索每个国家来解决多词国家的问题。

  #remove punctuation
  raw2 <- gsub("[[:punct:]\n]","",raw)
  #get the list of countries we're searching for
  countries = sort(unique(tolower(world.cities$country.etc)))
  #this will be the discovery matrix
  raw3 <- matrix(0,nrow=length(raw),ncol=length(countries))
  colnames(raw3) = countries
  #search for each country by itself
  for(i in countries){
    ind = grep(i,tolower(raw2))
    raw3[ind,i] = 1
  }
  #result is an nxk matrix, where n is the number of obs in raw
  #and k is the number of countries (239 in my test)
  raw3

【讨论】：