【问题标题】:Dplyr tranformation based on string filtering and conditions基于字符串过滤和条件的dplyr转换
【发布时间】:2020-06-19 14:13:28
【问题描述】:

我想在 R 中转换凌乱的数据集,

但是我在搞清楚如何做到这一点时遇到问题,我提供了示例数据集和我需要实现的结果:

dataset <- tribble(
  ~ID, ~DESC,
  1, "3+1Â 81Â mÂ", 
  2, "2+1Â 90Â mÂ",
  3, "3+KK 28Â mÂ",
  4, "3+1 120 m (Mezone)")
dataset

dataset_tranformed <- tribble(
  ~ID, ~Rooms, ~Meters, ~Mezone, ~KK,
  1, 4, 81,0, 0,
  2, 3, 90,0,0,
  3, 3, 28,0,1,
  4, 4, 120,1, 0)
dataset_tranformed

首先需要分隔列,但是使用dataset %&gt;% separate(DESC, c("size", "meters_squared", "Mezone"), sep = " ") 不起作用,因为 (Mezone) 被丢弃了。

【问题讨论】:

    标签: r dplyr tidyverse data-manipulation


    【解决方案1】:

    我们可以通过评估和单独提取组件来做到这一点

    library(dplyr)
    library(stringr)
    library(tidyr)
    dataset %>% 
       mutate(Rooms = map_dbl(DESC,  ~
           str_extract(.x, "^\\d+\\+\\d*") %>% 
             str_replace("\\+$", "+0") %>% 
             rlang::parse_expr(.) %>% 
             eval ), 
       Meters = str_extract(DESC, "(?<=\\s)\\d+(?=Â)"),
       Mezone = +(str_detect(DESC, "Mezone")),
       KK = +(str_detect(DESC, "KK"))) %>%
      select(-DESC)
    # A tibble: 4 x 5
    #     ID Rooms Meters Mezone    KK
    #  <dbl> <dbl> <chr>   <int> <int>
    #1     1     4 81          0     0
    #2     2     3 90          0     0
    #3     3     3 28          0     1
    #4     4     4 120         1     0
    

    或者另一个选项是extract,然后使用str_detect

    dataset %>% 
       extract(DESC, into = c("Rooms1", "Rooms2", "Meters"), 
         "^(\\d+)\\+(\\d*)[^0-9]+(\\d+)", convert = TRUE, remove = FALSE) %>%
       transmute(ID, Mezone = +(str_detect(DESC, "Mezone")),
            KK = +(is.na(Rooms2)), Rooms =  Rooms1 + replace_na(Rooms2, 0), Meters )
    # A tibble: 4 x 5
    #     ID Mezone    KK Rooms Meters
    #  <dbl>  <int> <int> <dbl>  <int>
    #1     1      0     0     4     81
    #2     2      0     0     3     90
    #3     3      0     1     3     28
    #4     4      1     0     4    120
    

    【讨论】:

      猜你喜欢
      • 2018-02-19
      • 2022-12-15
      • 1970-01-01
      • 2018-01-25
      • 1970-01-01
      • 1970-01-01
      • 2020-03-31
      • 2021-01-27
      • 1970-01-01
      相关资源
      最近更新 更多