在 R 中使用多个重复变量整理数据答案

【问题标题】：Tidying data with several repeating variables in R在 R 中使用多个重复变量整理数据
【发布时间】：2021-02-18 13:46:16
【问题描述】：

我有一个如下所示的数据框。有几个变量（如“c”和“z”）用于测量健康、动物、环境和金钱。在实际的dataframe中，还有很多其他的列没有遵循这种模式，而是穿插在各处。

id  c_health  c_animals  c_enviro  c_money  z_health  z_animals  z_enviro  z_money
1   3         2          4         5        7         9          6         8
2   2         3          5         4        8         7          6         9
3   4         1          2         3        9         6          8         7

我正在尝试重新排列数据以使其“整洁”。当我当前的数据集中有多个变量时，我不确定该怎么做。这是我最终希望得到的结果：

id  c  z  message
1   3  7  health
1   2  9  animals
1   4  6  enviro
1   5  8  money
2   2  8  health
2   3  7  animals
2   5  6  enviro
2   4  9  money
3   4  9  health
3   1  6  animals
3   2  8  enviro
3   3  7  money

如果数据框只包含以下列，我可以通过以下方式整理：

id  c_health  c_animals  c_enviro  c_money
1   3         2          4         5
2   2         3          5         4
3   4         1          2         3

df <- df %>%
   gather(., key = "question", value = "response", 2:5)

【问题讨论】：

标签： r dplyr tidyverse tidyr

【解决方案1】：

您可以使用 tidyr 包和 pivot_longer 来做到这一点：

library(tidyr)
library(dplyr)



df %>% 
    pivot_longer(cols = 2:ncol(df),
        names_to = c(".value", "message"), 
        names_sep = "_")

【讨论】：

【解决方案2】：

您在使用gather 时走在了正确的轨道上，但需要一些额外的步骤来将前缀与列名分开。请尝试以下操作：

library(dplyr)
library(tidyr)

df = data.frame(
  id = c(1,2,3),
  c_health = c(3,2,4),
  c_animals = c(2,3,1),
  z_health = c(7,8,9),
  z_animals = c(9,7,6),
  stringsAsFactors = FALSE
)

output = df %>%
  # gather on all columns other than id
  gather(key = "question", value = "response", -all_of("id")) %>%
  # split off prefix and rest of column name
  mutate(prefix = substr(question,1,1),
         desc = substr(question,3,nchar(question))) %>%
  # keep just the columns of interest
  select(id, prefix, desc, response) %>%
  # reshape wider
  spread(prefix, response)

更新 - 我对不同前缀长度的评论没有返回正确答案。因为 [] 索引在 mutate 中不起作用。相同的想法但正确的语法如下：

output = df %>%
  # gather on all columns other than id
  gather(key = "question", value = "response", -all_of("id")) %>%
  # split off prefix and rest of column name
  mutate(split = strsplit(question, "_")) %>%
  mutate(prefix = sapply(split, function(x){x[1]}),
         desc = sapply(split, function(x){x[2]})) %>%
  # keep just the columns of interest
  select(id, prefix, desc, response) %>%
  # reshape wider
  spread(prefix, response)

【讨论】：

如何修改它以适应不同长度的前缀？比如 convince_health 与 you_health。
假设每个列名中只有一个下划线_，您可以使用prefix = strsplit(question, "_")[1] 之类的东西来拆分字符串并获取下划线之前的文本，并使用desc = strsplit(question,"_")[2] 来获取下划线之后的文本下划线。
另外，@Kelsey 的回答可能会一口气完成所有这些。 pivot_longer（及其合作伙伴 pivot_wider）正在成熟的功能将取代 gather 和 spread（即将退役）。