【问题标题】:Remove strings from text and place into separate columns从文本中删除字符串并放入单独的列中
【发布时间】:2020-04-15 18:51:48
【问题描述】:

我正在尝试删除下面字符串中的纬度/经度坐标,以将其放入 R 中的两个单独的列“lat”和“long”中。我在 dplyr 中使用单独的运气并不好。任何帮助将不胜感激。

data_clean <- c("7205 MILLS CIVIC PKWY\nWEST DES MOINES 50266\n(41.561342, -93.806489)",
                "305 AIRPORT RD\nAMES 50010\n(42.001123, -93.61365)",                   
                "210 EAST TOWER PARK DR\nWATERLOO 50702\n(42.456362, -92.352552)")      

data_clean_df <- as.data.frame(data_clean)

【问题讨论】:

标签: r data-cleaning


【解决方案1】:

我们可以使用tidyr::extractdata_clean中的数据分成3组。

library(dplyr)
library(tidyr)

data_clean_df %>%
   mutate(data_clean = gsub('\n', '', data_clean)) %>%
   extract(data_clean, into = c('address', 'lat', 'lon'), 
      regex = '(.*)\\((.*),\\s+(.*)\\)', convert = TRUE)

#                                    data_clean     lat      lon
#1 7205 MILLS CIVIC PKWY\nWEST DES MOINES 50266 41.5613 -93.8065
#2                   305 AIRPORT RD\nAMES 50010 42.0011 -93.6137
#3       210 EAST TOWER PARK DR\nWATERLOO 50702 42.4564 -92.3526

【讨论】:

    【解决方案2】:
    library(dplyr)
    library(tidyr)
    library(stringr)
    
    data_clean_df %>% 
      separate(data_clean, into = c("a", "b", "c"), sep = "\n") %>% 
      mutate(c = str_remove_all(c, "\\(|\\)")) %>%
      separate(c, c("lat", "lon"), sep = ", ", convert = TRUE)
    
                           a                     b      lat       lon
    1  7205 MILLS CIVIC PKWY WEST DES MOINES 50266 41.56134 -93.80649
    2         305 AIRPORT RD            AMES 50010 42.00112 -93.61365
    3 210 EAST TOWER PARK DR        WATERLOO 50702 42.45636 -92.35255
    

    【讨论】:

      【解决方案3】:

      如果你只是想拉出经纬度的另一种选择:

      library(tidyverse)
      
      data_clean <- c("7205 MILLS CIVIC PKWY\nWEST DES MOINES 50266\n(41.561342, -93.806489)",
                      "305 AIRPORT RD\nAMES 50010\n(42.001123, -93.61365)",                   
                      "210 EAST TOWER PARK DR\nWATERLOO 50702\n(42.456362, -92.352552)")      
      
      data_clean_df <- as.data.frame(data_clean, stringsAsFactors = F)
      
      data_clean_df %>%
        mutate(lat = str_extract(data_clean, "(?<=\\().*?(?=,)"),
               long = str_extract(data_clean, paste0("(?<=", lat, ",\\s).*?(?=\\))")))
      #>                                                              data_clean
      #> 1 7205 MILLS CIVIC PKWY\nWEST DES MOINES 50266\n(41.561342, -93.806489)
      #> 2                    305 AIRPORT RD\nAMES 50010\n(42.001123, -93.61365)
      #> 3       210 EAST TOWER PARK DR\nWATERLOO 50702\n(42.456362, -92.352552)
      #>         lat       long
      #> 1 41.561342 -93.806489
      #> 2 42.001123  -93.61365
      #> 3 42.456362 -92.352552
      

      【讨论】:

        【解决方案4】:

        这是使用gsub() 的基本 R 解决方案

        df <- data.frame(data_clean = gsub("(.*)\n.*","\\1",data_clean),
                         lat = gsub(".*?\\((.*),.*","\\1",data_clean),
                         lon = gsub(".*,(.*)\\)","\\1",data_clean))
        

        这样

                                            data_clean       lat         lon
        1 7205 MILLS CIVIC PKWY\nWEST DES MOINES 50266 41.561342  -93.806489
        2                   305 AIRPORT RD\nAMES 50010 42.001123   -93.61365
        3       210 EAST TOWER PARK DR\nWATERLOO 50702 42.456362  -92.352552
        

        【讨论】:

          猜你喜欢
          • 2022-01-16
          • 1970-01-01
          • 1970-01-01
          • 1970-01-01
          • 1970-01-01
          • 2018-03-18
          • 2013-10-10
          • 1970-01-01
          • 1970-01-01
          相关资源
          最近更新 更多