【问题标题】:stacking/melting multiple columns into multiple columns in R在R中将多列堆叠/熔化成多列
【发布时间】:2020-02-06 00:26:54
【问题描述】:

我正在尝试将数据框的多个特定列融合/堆叠/收集为 2 列,保留所有其他列。 我在stackoverflow上尝试了很多很多答案,但都没有成功(下面有一些)。我这里基本上有类似这个帖子的情况: Reshaping multiple sets of measurement columns (wide format) into single columns (long format) 只有更多的列要保留和组合。重要的是要提到我的年份列是因素,我的列比下面列出的示例多得多,所以我想调用列名而不是位置。

>df
ID Code Country     year.x   value.x  year.y value.y year.x.x value.x.x              
1  A    USA         2000     34.33422 2001 35.35241  2002   42.30042 
1  A    Spain       2000     34.71842 2001 39.82727  2002   43.22209 
3  B    USA         2000     35.98180 2001 37.70768  2002   44.40232 
3  B    Peru        2000     33.00000 2001 37.66468  2002   41.30232 
4  C    Argentina   2000     37.78005 2001 39.25627  2002   45.72927 
4  C    Peru        2000     40.52575 2001 40.55918  2002   46.62914

我根据上面看起来非常相似的帖子尝试在 tidyr 中使用 pivot_longer,这取决于我所做的事情导致各种错误:

pivot_longer(df, 
             cols = -c(ID, Code, Country), 
             names_to = c(".value", "group"),
             names_sep = ".")

我还以各种方式在 reshape2 中使用了melt,它们要么只熔化了值列,要么只熔化了年列。如:

new.df <- reshape2:::melt(df, id.var = c("ID", "Code", "Country"), measure.vars=c("value.x", "value.y", "value.x.x", "value.y.y", "value.x.x.x", "value.y.y.y"), value.name = "value", variable.vars=c('year.x','year.y', "year.x.x", "year.y.y", "year.x.x.x", "year.y.y.y", "value.x", variable.name = "year")

我还根据其他帖子尝试了 dplyr gather,但我发现很难理解帮助页面和帖子。 明确我想要实现的目标:

ID Code Country  year   value                
1  A    USA      2000   34.33422  
1  A    Spain    2000   34.71842  
3  B    USA      2000   35.98180  
3  B    Peru     2000   33.00000  
4  C    Argentina2000   37.78005  
4  C    Peru     2000   40.52575 
1  A    USA      2001   35.35241  
1  A    Spain    2001   39.82727  
3  B    USA      2001   37.70768  
3  B    Peru     2001   37.66468  
4  C    Argentina2001   39.25627  
4  C    Peru     2001   40.55918 
1  A    USA      2002   42.30042  
etc.

非常感谢这里的帮助。

【问题讨论】:

标签: r dplyr tidyr reshape2 melt


【解决方案1】:

我们可以指定names_pattern

library(tidyr)
library(dplyr)
df %>%  
   pivot_longer(cols = -c(ID, Code, Country),
       names_to = c(".value", "group"),names_pattern = "(.*)\\.(.*)")

或者根据?pivot_longer使用names_sep和转义.

names_sep - names_sep 采用与separate() 相同的规范,可以是数字向量(指定要中断的位置),也可以是单个字符串(指定要拆分的正则表达式)。

这意味着默认情况下正则表达式是 on 并且正则表达式中的 . 匹配任何字符而不是文字点。要获取文字值,请转义或将其放在方括号内

pivot_longer(df, 
         cols = -c(ID, Code, Country), 
          names_to = c(".value", "group"),
          names_sep = "\\.")
# A tibble: 18 x 6
#      ID Code  Country   group  year value
#   <int> <chr> <chr>     <chr> <int> <dbl>
# 1     1 A     USA       x      2000  34.3
# 2     1 A     USA       y      2001  35.4
# 3     1 A     USA       z      2002  42.3
# 4     1 A     Spain     x      2000  34.7
# 5     1 A     Spain     y      2001  39.8
# 6     1 A     Spain     z      2002  43.2
# 7     3 B     USA       x      2000  36.0
# 8     3 B     USA       y      2001  37.7
# 9     3 B     USA       z      2002  44.4
#10     3 B     Peru      x      2000  33  
#11     3 B     Peru      y      2001  37.7
#12     3 B     Peru      z      2002  41.3
#13     4 C     Argentina x      2000  37.8
#14     4 C     Argentina y      2001  39.3
#15     4 C     Argentina z      2002  45.7
#16     4 C     Peru      x      2000  40.5
#17     4 C     Peru      y      2001  40.6
#18     4 C     Peru      z      2002  46.6

更新

对于更新的数据集

library(stringr)
df2 %>% 
   rename_at(vars(matches("year|value")), ~ 
     str_replace(., "^([^.]+\\.[^.]+)\\.([^.]+)$", "\\1\\2")) %>% 
     pivot_longer(cols = -c(ID, Code, Country),
        names_to = c(".value", "group"),names_pattern = "(.*)\\.(.*)")

或者没有rename,使用正则表达式环视

df2 %>%
   pivot_longer(cols = -c(ID, Code, Country), 
       names_to = c(".value", "group"),
           names_sep = "(?<=year|value)\\.")

数据

df <- structure(list(ID = c(1L, 1L, 3L, 3L, 4L, 4L), Code = c("A", 
"A", "B", "B", "C", "C"), Country = c("USA", "Spain", "USA", 
"Peru", "Argentina", "Peru"), year.x = c(2000L, 2000L, 2000L, 
2000L, 2000L, 2000L), value.x = c(34.33422, 34.71842, 35.9818, 
33, 37.78005, 40.52575), year.y = c(2001L, 2001L, 2001L, 2001L, 
2001L, 2001L), value.y = c(35.35241, 39.82727, 37.70768, 37.66468, 
39.25627, 40.55918), year.z = c(2002L, 2002L, 2002L, 2002L, 2002L, 
2002L), value.z = c(42.30042, 43.22209, 44.40232, 41.30232, 45.72927, 
46.62914)), class = "data.frame", row.names = c(NA, -6L))



df2 <- structure(list(ID = c(1L, 1L, 3L, 3L, 4L, 4L), Code = c("A", 
"A", "B", "B", "C", "C"), Country = c("USA", "Spain", "USA", 
"Peru", "Argentina", "Peru"), year.x = c(2000L, 2000L, 2000L, 
2000L, 2000L, 2000L), value.x = c(34.33422, 34.71842, 35.9818, 
33, 37.78005, 40.52575), year.y = c(2001L, 2001L, 2001L, 2001L, 
2001L, 2001L), value.y = c(35.35241, 39.82727, 37.70768, 37.66468, 
39.25627, 40.55918), year.x.x = c(2002L, 2002L, 2002L, 2002L, 
2002L, 2002L), value.x.x = c(42.30042, 43.22209, 44.40232, 41.30232, 
45.72927, 46.62914)), class = "data.frame", row.names = c(NA, 
-6L))

【讨论】:

  • 所以我应该补充一点,我的列继续这样:value.x.x.x 和 year.x.x.x、value.y.y.y、year.y.y.y 等。这是来自以前的 le​​ft_join。
  • 我知道。我试图简化示例中的许多列,所以这是我的错误。
  • @KNN 如您所见,我提供的names_pattern 是基于您展示的模式
  • 我更新了问题以包含更多列名。
  • 好的,我知道你做了什么。有没有办法让 str_replace() 正则表达式只匹配第一个周期之前的所有内容,因此不必为每个 . 添加该模式。所以兼容value.x、value.x.x、value.x.x.x、value.x.x.x.x等?
猜你喜欢
  • 2020-06-25
  • 2020-01-02
  • 2020-01-05
  • 1970-01-01
  • 1970-01-01
  • 2021-12-14
  • 1970-01-01
  • 1970-01-01
  • 2013-04-01
相关资源
最近更新 更多