TL;DR
我使用了raw 数据并将其转换为数据框。然后,我使用正则表达式和行迭代的组合逐列提取所需的信息。
导入必要的包和原始数据
要按照教程进行操作,您需要安装以下软件包:
如果您已经拥有所有这些,则无需使用install.packages() 函数。
install.packages( pkgs = c( "BBmisc", "maps", "magrittr", "purrr" ) )
library( BBmisc )
library( maps )
library( magrittr )
library( purrr )
raw <- c("Anna (1025) - North Carolina, USA - DEC 20, 2017",
"James (10) - - MEXICO - NOV 22, 2017",
"Susane (222) - Oulu, FINLAND - JUNE 1, 2016",
"Alex (20000) - SOUTH KOREA- MAR 11, 2015")
导入原始数据
raw <- c("Anna (1025) - North Carolina, USA - DEC 20, 2017",
"James (10) - - MEXICO - NOV 22, 2017",
"Susane (222) - Oulu, FINLAND - JUNE 1, 2016",
"Alex (20000) - SOUTH KOREA- MAR 11, 2015")
声明四列
鉴于存储在raw 中的数据,我们认为创建四列是合适的:
user_name:用户名
user_review_number:与用户评论相关联的标识号
user_country:用户所在国家
-
user_review_date: 日期 - 月日,年格式 - 创建用户评论
raw <- data.frame( user_name = raw
, user_review_number = raw
, user_country = raw
, user_review_date = raw
, stringsAsFactors = FALSE
)
正则表达式
Regular Expressions 允许使用特定语法进行复杂而灵活的搜索/替换。它们用于从raw 数据集中提取相关数据。
识别原始$user_name
此列包含括号前的用户名。
raw$user_name <- strsplit( x = raw$user_name
, split = "\\(|\\)"
, fixed = FALSE
)
# keep only the first element from each list, then unlist to obtain a character vector
raw$user_name <-
purrr::map( .x = raw$user_name, .f = 1 ) %>%
unlist()
# remove trailing whitespace
raw$user_name <- trimws( x = raw$user_name
, which = "right"
)
识别原始$user_review_number
此列包含用户的评论编号,它是一个整数,在两个括号之间有 1-10 位之间的任意数字。
raw$user_review_number <- strsplit( x = raw$user_review_number
, split = "\\(|\\)"
, fixed = FALSE
)
# keep only the second element from each list, then unlist to obtain a character vector
# and cast as integer
raw$user_review_number <-
purrr::map( .x = raw$user_review_number, .f = 2 ) %>%
unlist() %>%
as.integer()
识别原始的$user_country
这个专栏有点诡计。一些国家/地区用逗号分隔,另一些包含两部分名称(即 SOUTH KOREA),一些是缩写(即 USA),还有一些包含州信息(即 North Carolina, USA)。
有一百种方法可以做到这一点,但我使用的逻辑包含以下内容:
参见下面的代码。
# first, split by the parentheses
raw$user_country <- strsplit( x = raw$user_country
, split = "\\(|\\) "
, fixed = FALSE
)
# second, keep only the third elements from each list, then unlist to obtain character vector
raw$user_country <-
purrr::map( .x = raw$user_country, .f = 3 ) %>%
unlist()
# third, split by the dash marks, either one or two
raw$user_country <- strsplit( raw$user_country
, split = "\\-|\\- \\-"
, fixed = FALSE
)
# fourth, keep only the second elements from each list, then unlist to obtain character vector
raw$user_country <-
purrr::map( .x = raw$user_country, .f = 2 ) %>%
unlist()
# fifth, clear leading and trailing white space
raw$user_country <- trimws( x = raw$user_country
, which = "both"
)
# sixth, separate states marked by the apperance of a comma
raw$user_country <- strsplit( x = raw$user_country
, split = ","
, fixed = TRUE
)
# seventh, make two vectors:
# one for the first element (which may or not be the state within a country)
maybe.country <-
purrr::map( .x = raw$user_country, .f = 1 ) %>%
unlist()
# one for the second element (which will always be the country)
# note: need to convert NULL elements into NA
definitely.country <-
purrr::map( .x = raw$user_country, .f = 2, .null = NA ) %>%
unlist()
# eighth, replace the indices within maybe.country
# whose indices in definitely.country are non-NA values
# with those non-NA values from definitely.country.
# note: this is possible due to the indices within both
# maybe.country and definitely.country to be exact equivalents.
# (i.e. the 8th element in maybe.country will always align
# with the 8th element in definitely.country )
maybe.country[
which( !is.na( definitely.country ) )
] <- definitely.country[
which( !is.na( definitely.country ) )
]
# ninth, assign the character vector maybe.country to raw$user_country
raw$user_country <- maybe.country
# tenth, remove all leading and trailing white space
raw$user_country <- trimws( x = raw$user_country
, which = "both"
)
# eleventh, if the number of letters (length) of any element is more than 3,
# change the spelling to Capital Case.
# note: This logic comes from the maps::iso3166 data frame, which contains
# 3,166 country codes from the International Standards Organizations (ISO).
raw$user_country <- ifelse( test = nchar( raw$user_country ) == 2 |
nchar( raw$user_country ) == 3
, yes = raw$user_country
, no = stringr::str_to_title( string = raw$user_country )
)
# twelfth, check to make sure that all characters are either
# 2 character, 3 character, ISO country codes/names,
# shorter name used in the `maps` package, or the sovereign country
# by ensuring the length of the elements who meet this criteria
# is equal to the length of raw$user_country
length(
which( raw$user_country %in% maps::iso3166$a2 |
raw$user_country %in% maps::iso3166$a3 |
raw$user_country %in% maps::iso3166$ISOname |
raw$user_country %in% maps::iso3166$mapname |
raw$user_country %in% maps::iso3166$sovereignty
)
) == length( raw$user_country ) # [1] TRUE
识别原始$user_review_date
假设用户的评论数据始终是要存储在每个字符串中的最后一位文本,以下是您将如何去除该特定列的数据。
raw$user_review_date <- strsplit( x = raw$user_review_date
, split = "\\-\\s"
, fixed = FALSE
)
# keep only the last element from each list,
# unlist to obtain a character vector,
# standardize the dates
# note: assumes no NAs will appear for date
raw$user_review_date <-
purrr::map( .x = raw$user_review_date, .f = BBmisc::getLast ) %>%
unlist() %>%
as.Date( format = "%b %d, %Y" )