【问题标题】:extract country names (or other entity) from column从列中提取国家名称(或其他实体)
【发布时间】:2019-11-05 10:08:48
【问题描述】:

我在location 列中有一个包含国家和城市的data.frame,我想通过匹配来自library(maps)(或任何其他国家名称集合)的world.cities$country.etc 数据框来提取前者。

考虑这个例子:

df <- data.frame(location = c("Aarup, Denmark",
                              "Switzerland",
                              "Estonia: Aaspere"),
                 other_col = c(2,3,4))

我尝试使用此代码

df %>% extract(location,
               into = c("country", "rest_location"),
               remove = FALSE,
               function(x) x[which x %in% world.cities$country.etc])

但我没有成功;我期待这样的事情:

          location other_col     country rest_location
1   Aarup, Denmark         2     Denmark       Aarup, 
2      Switzerland         3 Switzerland              
3 Estonia: Aaspere         4     Estonia     : Aaspere

【问题讨论】:

    标签: r dataframe


    【解决方案1】:

    我们可以通过将所有国家名称粘贴在一起来创建一个模式,并使用str_extract_all 来获取与location 中的模式匹配的所有国家名称,并删除与国家名称匹配的单词以获得rest_location

    library(maps)
    library(stringr)
    
    all_countries <- str_c(unique(world.cities$country.etc), collapse = "|")
    df$country <- sapply(str_extract_all(df$location, all_countries), toString)
    df$rest_location <- str_remove_all(df$location, all_countries)
    #OR can also do
    #df$rest_location <- str_remove_all(df$location, df$country)
    
    df
    #          location other_col     country rest_location
    #1   Aarup, Denmark         2     Denmark       Aarup, 
    #2      Switzerland         3 Switzerland              
    #3 Estonia: Aaspere         4     Estonia     : Aaspere
    

    sapplytoString 用于country,因为如果location 中有多个国家/地区名称,它们都将连接在一个字符串中。

    【讨论】:

      【解决方案2】:

      你可以试试这个作为起点

      library(tidyverse)
      df %>% 
        rownames_to_column() %>% 
        separate_rows(location) %>% 
        mutate(gr = location %in% world.cities$country.etc) %>% 
        mutate(gr = ifelse(gr, "country", "rest_location")) %>% 
        spread(gr, location) %>% 
        right_join(df %>% 
                    rownames_to_column(), 
                    by = c("rowname", "other_col")) %>% 
        select(location, other_col, country, rest_location)
                location other_col     country rest_location
      1   Aarup, Denmark         2     Denmark         Aarup
      2      Switzerland         3 Switzerland          <NA>
      3 Estonia: Aaspere         4     Estonia       Aaspere
      

      值得注意的是,这仅适用于位置列中只有两个“单词”的情况。如有必要,您必须指定一个合适的单独例如sep=",|:"

      【讨论】:

        【解决方案3】:

        Base R(不包括地图包):

        # Import the library: 
        
        library(maps)
        
        # Split the string on the spaces: 
        
        country_city_vec <- strsplit(df$location, "\\s+")
        
        # Replicate the other col's rows by the split string vec: 
        
        rolled_out_df <- data.frame(other_col = rep(df$other_col, sapply(country_city_vec, length)), 
        
                                    location = gsub("[[:punct:]]", "", unlist(country_city_vec)), stringsAsFactors = F)
        
        # Match with the world df: 
        
        matched_with_world_df <- merge(df,
        
                                       setNames(rolled_out_df[rolled_out_df$location %in% world.cities$country.etc,],
                                                c("other_col", "country")),
        
                                       by = "other_col", all.x = T)
        
        # Extract the city/location drilldown: 
        
        matched_with_world_df$rest_location <- trimws(gsub("[[:punct:]]",
                                                           "",
                                                           gsub(paste0(matched_with_world_df$country,
                                                                       collapse = "|"),
                                                   "", matched_with_world_df$location)), "both")
        

        【讨论】:

          猜你喜欢
          • 1970-01-01
          • 1970-01-01
          • 2016-06-11
          • 1970-01-01
          • 1970-01-01
          • 2011-07-16
          • 1970-01-01
          • 1970-01-01
          • 1970-01-01
          相关资源
          最近更新 更多