将 HRS 数据从宽格式重塑为长格式并创建时间变量答案

【问题标题】：Reshaping the HRS data from wide to long format and creating a time variable将 HRS 数据从宽格式重塑为长格式并创建时间变量
【发布时间】：2019-08-13 09:59:59
【问题描述】：

我有以下数据集（包含大约 25 个变量，除了体重 = W 和身高 = H），所有时间跨度为 10 年。

目前有以下形式和无时间索引。

df <- structure(list(data = structure(1:4, .Label = c("Ind_1", "Ind_2", 
"Ind_3", "Ind_4"), class = "factor"), r1weight = c(56, 76, 87, 64
), r2weight = c(57, 75, 88, 66), r3weight = c(56, 76, 87, 65), r4weight = c(56L, 
73L, 85L, 63L), r5weight = c(55L, 77L, 84L, 65L), r1height = c(151L, 163L, 
173L, 153L), r2height = c(154L, 164L, NA, 154L), r3height = c(NA, 165L, NA, 
152L), r4height = c(153L, 162L, 172L, 154L), r5height = c(152,161,171,154)), class = 
"data.frame", row.names = c(NA, 
 -4L)) 

  data  r1w r2w r3w r4w r5w r1h r2h r3h r4h r5h
1 Ind_1  56  57  56  56  55 151 154  NA 153 152
2 Ind_2  76  75  76  73  77 163 164 165 162 161
3 Ind_3  87  88  87  85  84 173  NA  NA 172 171
4 Ind_4  64  66  65  63  65 153 154 152 154 154`

我需要添加时间变量并重塑为长格式，希望能得到这样的结果。

dflong <- structure(list(time = structure(1:20, .Label = c("1", "2", 
     "3", "4", "5", "1","2","3","4","5", "1","2","3","4","5","1","2","3","4","5"), 
     class = "factor"), Ind = c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3,4,4,4,4,4), W = c(56,57,56,56,55,76,75,76,73,77,87,88,87,85,84,64,66,65,63,65),
     H = c(151,154,NA,153,152,163,164,165,162,161,173,NA,NA,172,171,153,154,152,154,154)), class = "data.frame", row.names = c(NA, -20L))

看起来

   time Ind  W   H
1     1   1 56 151
2     2   1 57 154
3     3   1 56  NA
4     4   1 56 153
5     5   1 55 152
6     1   2 76 163
7     2   2 75 164
8     3   2 76 165
9     4   2 73 162
10    5   2 77 161
11    1   3 87 173
12    2   3 88  NA
13    3   3 87  NA
14    4   3 85 172
15    5   3 84 171
16    1   4 64 153
17    2   4 66 154
18    3   4 65 152
19    4   4 63 154
20    5   4 65 154`

我尝试使用reshape2-command，到目前为止我得到了：

library(reshape2)
dflong <- melt(df,id.vars = c("idhhpn",r1w-r10w, r1h-r10h (help writing compactly),
     time(needs help constructing) )`

我不想写“r1w, r2w, r3w”，但更像是 r1weight-r10weight，所以我不必为所有 25 个变量写所有 10 个时间实例。

到目前为止，我已经到了这一点

使用以下代码

melt <- melt(setDT(HRSdata), measure = patterns("idhhpn", "srhlt", "highbp", "diabetes", "cancer", "lungev", "heartp", "strokev", "psychev", "arth", "obese", "agey", "marpart", "male", "black", "hispan", "logass", "logdebt", "atotal", "debt", "lths", "hsorged", "somehs", "scorAA", "bachelor", "graduate", "works62", "works65", "momagey", "dadagey", "dadalive", "momalive", "vigact3", "smokesn"), 
     value.name = c("idhhpn", "srhlt", "highbp", "diabetes", "cancer", "lungev", "heartp", "strokev", "psychev", "arth", "obese", "agey", "marpart", "male", "black", "hispan", "logass", "logdebt", "atotal", "debt", "lths", "hsorged", "somehs", "scorAA", "bachelor", "graduate", "works62", "works65", "momagey", "dadagey", "dadalive", "momalive", "vigact3", "smokesn"), 
     variable.name = "time")[, 
      idhhpn := as.integer(sub("\\D+", "", HRSdata))][order(idhhpn)][, .(time, idhhpn, srhlt, highbp, diabetes, cancer, lungev, heartp, strokev, psychev, arth, obese, agey, marpart, male, black, hispan, logass, logdebt, atotal, debt, lths, hsorged, somehs, scorAA, bachelor, graduate, works62, works65, momagey, dadagey, dadalive, momalive, vigact3, smokesn        )]

【问题讨论】：

一定要reshape2吗？我可以很容易地想出一个解决方案，但它会使用 tidyr 的gather
您可以尝试只保留idhhpn 和要融化的变量在measure 列中

标签： r reshape2 panel-data

【解决方案1】：

使用gather 和spread 的tidyverse 方法将是

library(tidyverse)

df %>%
  gather(time, ind, -data) %>%
  separate(time, into = c("indName", "time")) %>%
  spread(indName, ind)


#    data time  H  W
#1  Ind_1   1 151 56
#2  Ind_1   2 154 57
#3  Ind_1   3  NA 56
#4  Ind_1   4 153 56
#5  Ind_1   5 152 55
#6  Ind_2   1 163 76
#7  Ind_2   2 164 75
#8  Ind_2   3 165 76
#9  Ind_2   4 162 73
#10 Ind_2   5 161 77
#11 Ind_3   1 173 87
#12 Ind_3   2  NA 88
#13 Ind_3   3  NA 87
#14 Ind_3   4 172 85
#15 Ind_3   5 171 84
#16 Ind_4   1 153 64
#17 Ind_4   2 154 66
#18 Ind_4   3 152 65
#19 Ind_4   4 154 63
#20 Ind_4   5 154 65

相同的解决方案，但修改后的变量名称为“r[num][varname]”（@iod）：

df %>%
  gather(time, ind, -data) %>%
  mutate(time=gsub("r([0-9])","\\1_",time)) %>%
  separate(time, into = c("time","indName")) %>%
  spread(indName, ind)

    data time height weight
1  Ind_1    1    151     56
2  Ind_1    2    154     57
3  Ind_1    3     NA     56
4  Ind_1    4    153     56
5  Ind_1    5    152     55
6  Ind_2    1    163     76
7  Ind_2    2    164     75
8  Ind_2    3    165     76
9  Ind_2    4    162     73
10 Ind_2    5    161     77
11 Ind_3    1    173     87
12 Ind_3    2     NA     88
13 Ind_3    3     NA     87
14 Ind_3    4    172     85
15 Ind_3    5    171     84
16 Ind_4    1    153     64
17 Ind_4    2    154     66
18 Ind_4    3    152     65
19 Ind_4    4    154     63
20 Ind_4    5    154     65

【讨论】：

假设我有变量“W”、“H”、“bmi”、“pulse”、“age”。那会是： library(tidyverse) df %>% gather(time, ind, -data) %>% separate(time, into = c("W", "H", "bmi", "pulse", "age ")) %>% 传播(W, ind)。既然我的变量 r1age, r2age, ..., r10age 属于一起，那么 R 又如何？？
此解决方案假定所有变量的格式为WORD_[num]，例如“W_2”或“bmi_4” - 它适用于所有这些。它不适用于“r1age”之类的东西。这就是separate 所做的：它将变量名称（已收集到ind 列中）分成两部分，在下划线处（句点也可以）。不需要列出所有的变量，只要它们都具有这种格式。
我编辑了解决方案，将令人困惑的“W”辅助列替换为“indName”。
尝试单独运行每个段，看看这个解决方案是如何一步一步工作的：gather 创建了一个长熔格式，所有 H 和 W 在一列中。 Separate 创建两列 - 一列带有 H 或 W，另一列带有时间指示器。然后spread 采用 H/W 列并使用“ind”列重新转换它们，为每个变量创建一个单独的列。
感谢@iod 的编辑，如果您的列名没有下划线或空格，您可以在sep 的separate 参数中添加自定义分隔符。这取决于您的列名。

【解决方案2】：

使用data.table 的选项使用measure/patterns 将是使用melt。在示例中，列名有共同的patterns 为'weight', 'height' 我们在measure 参数中指定它以将其转换为'long' 格式，然后提取带有sub 的数字部分以创建'工业'

library(data.table)
melt(setDT(df), measure = patterns("weight", "height"), value.name = c("W", "H"), 
  variable.name = "time")[, 
    Ind := as.integer(sub("\\D+", "", data))][order(Ind)][, .(time, Ind, W, H)]
#   time Ind  W   H
# 1:    1   1 56 151
# 2:    2   1 57 154
# 3:    3   1 56  NA
# 4:    4   1 56 153
# 5:    5   1 55 152
# 6:    1   2 76 163
# 7:    2   2 75 164
# 8:    3   2 76 165
# 9:    4   2 73 162
#10:    5   2 77 161
#11:    1   3 87 173
#12:    2   3 88  NA
#13:    3   3 87  NA
#14:    4   3 85 172
#15:    5   3 84 171
#16:    1   4 64 153
#17:    2   4 66 154
#18:    3   4 65 152
#19:    4   4 63 154
#20:    5   4 65 154

【讨论】：

“setDT”是什么意思？我在这里写什么呢？当你写“ind”时，我是否必须写我的 id 变量，即“idhhpn”？还是我只是保持您的代码原样。很抱歉，我不明白我需要在您的代码中更改哪些内容以及要保留哪些内容，但我真的需要一些帮助。
@EmilKrabbe setDT 是将“data.frame”转换为data.table 以便它可以访问data.table 方法的函数。从转换到 data.table 的其他对象应该是相同的。如果对象已经是 data.table，则不需要。关于 Ind，是的，您必须将其更改为实际数据集中的列名。如果您遇到任何问题，请告诉我
我希望我可以向您发送我的代码的图片，但我认为我不能在这里。我在你的代码中写了我所有的变量。所以你写“W”和“H”的地方我写了我的盟友我的 24，但不幸的是它似乎不起作用......嗯，我可以告诉你什么吗？
@EmilKrabbe 假设您有 12 个重量和 12 个高度列名称，模式为“W”或“H”。它仍然是 2 列，因为它们从宽格式转换为长格式。你能检查一下是不是这样吗
3: In [.data.table(melt(setDT(HRSdata), measure = patterns("idhhpn", : 提供 331 项分配给列 'idhhpn' 的 250890 项》(回收剩余323 项）。4：在 [.data.table(melt(setDT(HRSdata), measure = patterns("idhhpn", : 强制整数 RHS 加倍以匹配目标列的类型（第 2 列名为 'idhhpn'）。如果目标列的 double 类型是正确的，最好避免强制并将 RHS 创建为 double 类型。

【解决方案3】：

你可以使用melt函数从data.table然后cbind-

setDT(df)
df <- cbind(setnames(melt(df)[grep("^H_",variable),],"value","H"),
            setnames(melt(df)[grep("^W_",variable),],"value","W"))
df <- df[,Ind:=gsub(".*_","",data)] ##cleaning Ind_
df <- df[, time:=1:.N, by = .(Ind)]
df <- df[,.(time,W,H,Ind)]

输出-

> df
    time  W   H Ind
 1:    1 56 151   1
 2:    1 76 163   2
 3:    1 87 173   3
 4:    1 64 153   4
 5:    2 57 154   1
 6:    2 75 164   2
 7:    2 88  NA   3
 8:    2 66 154   4
 9:    3 56  NA   1
10:    3 76 165   2
11:    3 87  NA   3
12:    3 65 152   4
13:    4 56 153   1
14:    4 73 162   2
15:    4 85 172   3
16:    4 63 154   4
17:    5 55 152   1
18:    5 77 161   2
19:    5 84 171   3
20:    5 65 154   4

【讨论】：