【问题标题】:Reshape parts of columns into long format with regular expression使用正则表达式将部分列重塑为长格式
【发布时间】:2019-05-19 12:25:06
【问题描述】:

我有一个宽格式的数据框。

df <- data.frame(
time = as.Date('2009-01-01') + 0:5,
D.13.JA = rnorm(6, 0, 1),
D.40.JA = rnorm(6, 0, 1),
D.90.JA = rnorm(6, 0, 1),
A.13.JA = rnorm(6, 0, 1),
R.13.JA = rnorm(6, 0, 1)
)
        time    D.13.JA    D.40.JA    D.90.JA      A.13.JA     R.13.JA
1 2009-01-01 -2.2529442  0.1341954  0.3024757 -0.465533145 -0.49755117
2 2009-01-02  1.0698570 -1.3597724  0.6607091  0.001913148  0.92522135
3 2009-01-03  1.7558374 -1.0280084 -0.1446586 -0.355776775  0.12556738
4 2009-01-04 -0.2571767 -0.9065826  0.9340532 -0.150408270 -0.57386938
5 2009-01-05  0.2389923 -1.2818616  0.5643812 -1.272623868 -0.05700965
6 2009-01-06  1.6444592 -1.5610767 -1.4377561 -0.701273356  0.29777858

我希望将数据框转换成这种格式:

        time DirDegree Type         Wh
1 2009-01-01   D.13   JA         -2.2529442
2 2009-01-02   D.13   JA          1.0698570
3 2009-01-03   D.13   JA          1.7558374
4 2009-01-04   D.13   JA         -0.2571767
5 2009-01-05   D.13   JA          0.2389923
6 2009-01-06   D.13   JA          1.6444592

到目前为止,我已经成功将其转换为整洁的格式

df.tidy = df %>%
    gather(key, Wh, -time) %>%
    separate(key, c("Dir", "Degree", "Type"), "\\.")
        time Dir Degree Type          Wh
1 2009-01-01   D     13   JA -1.18105757
2 2009-01-02   D     13   JA  1.34437449
3 2009-01-03   D     13   JA -0.08451173
4 2009-01-04   D     13   JA -1.88959285
5 2009-01-05   D     13   JA  1.25388470
6 2009-01-06   D     13   JA -1.24286611

我已尝试根据this answer对其进行格式化

test1 = df %>%
    gather(key, value, -time) %>%
    extract(key, c("DirDeg", "Type"), "(..\\..)\\.(.)")

test2 = df %>%
    gather(key, value, -time) %>%
    extract(key, c("DirDeg", "Type"), "(\\.)\\.()")

这两个都给我

         time DirDeg Type       value
1  2009-01-01   <NA> <NA> -1.18105757
2  2009-01-02   <NA> <NA>  1.34437449
3  2009-01-03   <NA> <NA> -0.08451173
4  2009-01-04   <NA> <NA> -1.88959285
5  2009-01-05   <NA> <NA>  1.25388470
6  2009-01-06   <NA> <NA> -1.24286611
7  2009-01-01   <NA> <NA> -0.55782526

【问题讨论】:

    标签: r dplyr reshape tidyr


    【解决方案1】:

    做:

    df.tidy = df %>%
      gather(key, Wh, -time) %>%
      extract(key, c("DirDeg", "Type"), "(.*)\\.(\\w+)$")
    

    这将提取直到. 和末尾的任何字母数字\\w+ 的所有内容。

    结果:

             time    DirDeg Type           Wh
    1  2009-01-01      D.13   JA   -2.2529442
    2  2009-01-02      D.13   JA    1.0698570
    3  2009-01-03      D.13   JA    1.7558374
    4  2009-01-04      D.13   JA   -0.2571767
    5  2009-01-05      D.13   JA    0.2389923
    6  2009-01-06      D.13   JA    1.6444592
    7  2009-01-01      D.40   JA    0.1341954
    8  2009-01-02      D.40   JA   -1.3597724
    9  2009-01-03      D.40   JA   -1.0280084
    10 2009-01-04      D.40   JA   -0.9065826
    11 2009-01-05      D.40   JA   -1.2818616
    12 2009-01-06      D.40   JA   -1.5610767
    13 2009-01-01      D.90   JA    0.3024757
    14 2009-01-02      D.90   JA    0.6607091
    15 2009-01-03      D.90   JA   -0.1446586
    16 2009-01-04      D.90   JA    0.9340532
    17 2009-01-05      D.90   JA    0.5643812
    18 2009-01-06      D.90   JA   -1.4377561
    19 2009-01-01      A.13   JA -0.465533145
    20 2009-01-02      A.13   JA  0.001913148
    21 2009-01-03      A.13   JA -0.355776775
    22 2009-01-04      A.13   JA -0.150408270
    23 2009-01-05      A.13   JA -1.272623868
    24 2009-01-06      A.13   JA -0.701273356
    25 2009-01-01      R.13   JA  -0.49755117
    26 2009-01-02      R.13   JA   0.92522135
    27 2009-01-03      R.13   JA   0.12556738
    28 2009-01-04      R.13   JA  -0.57386938
    29 2009-01-05      R.13   JA  -0.05700965
    30 2009-01-06      R.13   JA   0.29777858
    

    【讨论】:

      【解决方案2】:

      我们也可以使用separate。显示的 . 有两个匹配项 - 1) . 后跟一个数字,2) . 后跟大写字母。如果我们提供正则表达式环视以匹配大写字符之前的.,即第二个匹配项,它将以这种方式拆分

      library(tidyverse)
      df %>% 
        gather(key, Wh, -time) %>% 
        separate(key, into = c("DirDeg", "Type"), sep = "\\.(?=[A-Z])") %>%
        as_tibble
      # A tibble: 30 x 4
      #   time       DirDeg Type        Wh
      #   <date>     <chr>  <chr>    <dbl>
      # 1 2009-01-01 D.13   JA    -0.546  
      # 2 2009-01-02 D.13   JA     0.537  
      # 3 2009-01-03 D.13   JA     0.420  
      # 4 2009-01-04 D.13   JA    -0.584  
      # 5 2009-01-05 D.13   JA     0.847  
      # 6 2009-01-06 D.13   JA     0.266  
      # 7 2009-01-01 D.40   JA     0.445  
      # 8 2009-01-02 D.40   JA    -0.466  
      # 9 2009-01-03 D.40   JA    -0.848  
      #10 2009-01-04 D.40   JA     0.00231
      # … with 20 more rows
      

      【讨论】:

        猜你喜欢
        • 2013-03-18
        • 2020-12-31
        • 2021-10-26
        相关资源
        最近更新 更多