【问题标题】:how to separate column into two columns based on a condition如何根据条件将列分成两列
【发布时间】:2019-11-22 17:27:42
【问题描述】:

我有一个包含珊瑚测量值的数据集。除了每次测量,还收集了额外的元数据,包括实验模块上菌落的位置或“位置”。我正在尝试将数据框中的 Location 列分为水平和垂直组件。每个位置代码都是一个字母数字条目,其中字母代表列 (A-D),数字部分代表行 (1-4)。

在许多情况下,珊瑚位于下一行(例如 A1_2)或下一列(例如 A_B1)的边缘,因此条目的格式从字母和数字变为一个字母和两个数字或两个字母和一个数字。

d <- structure(list(`Module #` = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L), .Label = c("111", "112", "113", "114", "115", 
"116", "211", "212", "213", "214", "215", "216"), class = "factor"), 
    Side = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L
    ), .Label = c("N", "S", "T"), class = "factor"), TimeStep = c(4L, 
    4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L), Location = c("A1", "A1_2", 
    "A2", "A3", "A3_4", "A4", "B_C3", "B1", "B1_2", "B2"), Date = structure(c(NA_real_, 
    NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, 
    NA_real_, NA_real_, NA_real_), class = "Date"), Year = structure(c(1L, 
    1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("17", "18"
    ), class = "factor"), Site = structure(c(NA_integer_, NA_integer_, 
    NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, 
    NA_integer_, NA_integer_, NA_integer_), .Label = c("HAN", 
    "WAI"), class = "factor"), Treatment = c(NA_character_, NA_character_, 
    NA_character_, NA_character_, NA_character_, NA_character_, 
    NA_character_, NA_character_, NA_character_, NA_character_
    ), recruits = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0), Site_long = structure(c(2L, 
    2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("Hanauma Bay", 
    "Waikiki"), class = "factor"), Shelter = structure(c(1L, 
    1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("High", "Low"
    ), class = "factor")), row.names = c(NA, 10L), class = "data.frame")

head(d)

我想最终得到一个包含 2 个新列的数据框:一个名为“Column”的列和一个名为“Row”的列。 “列”是指位置代码的字母部分,“行”是指编号部分。请注意,每个列的值应为 1 或 3 个字符(例如,A1_2 的列 = A 或 A_B1 的列 = A_B)。

【问题讨论】:

    标签: r filter dplyr


    【解决方案1】:

    我们可以使用str_extract单独提取值

    library(tidyverse)
    d %>%
      mutate(Column = str_extract(Location, "[A-Z]_?[A-Z]?"), 
             Row = str_extract(Location, "[0-9]_?[0-9]?")) %>%
      select(Location, Column, Row)
    
    #   Location Column Row
    #1        A1      A   1
    #2      A1_2      A 1_2
    #3        A2      A   2
    #4        A3      A   3
    #5      A3_4      A 3_4
    #6        A4      A   4
    #7      B_C3    B_C   3
    #8        B1      B   1
    #9      B1_2      B 1_2
    #10       B2      B   2
    

    或使用tidyr::extract 在一个正则表达式中将列分开

    d %>%
       extract(Location, into = c("Column", "Row"), 
               regex = "([A-Z]_?[A-Z]?)([0-9]_?[0-9]?)")
    

    我们可以使用 base R sub 来提取使用类似正则表达式的值

    d$Column <- sub("([A-Z]_?[A-Z]?).*", "\\1", d$Location)
    d$Row <- sub("[A-Z]_?[A-Z]?([0-9]_?[0-9]?)", "\\1", d$Location)
    

    【讨论】:

      【解决方案2】:

      使用 data.table 和 stringi:

      library('data.table')
      library('stringi')
      setDT(d)
      d[, .(Location, 
            Column = stri_extract_all_regex(Location, '[A-Z]_?[A-Z]?'), 
            Row = stri_extract_all_regex(Location, '[0-9]_?[0-9]?'))]
      
      #    Location Column Row
      # 1:       A1      A   1
      # 2:     A1_2      A 1_2
      # 3:       A2      A   2
      # 4:       A3      A   3
      # 5:     A3_4      A 3_4
      # 6:       A4      A   4
      # 7:     B_C3    B_C   3
      # 8:       B1      B   1
      # 9:     B1_2      B 1_2
      # 10:      B2      B   2
      

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 2013-12-19
        • 2019-06-27
        • 1970-01-01
        • 1970-01-01
        • 2021-07-17
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        相关资源
        最近更新 更多