需要从宽转换为长答案

【问题标题】：Need to Convert from Wide to Long需要从宽转换为长
【发布时间】：2019-07-17 00:36:25
【问题描述】：

嗨，我在 A 列中有这个具有唯一 Id 变量的数据集，然后是每个患者的后续肾脏扫描。这是一个 csv 文件，如果可能的话，我想使用 R 将其重塑为长格式。每个参与者可以进行 1-17 次的肾脏扫描。

还有一些 ID 被列为“否”，因为没有接收到扫描。我希望它被重新塑造成类似的东西

我知道以前按年份组织的有关此组织的问题，我有来自参与者的扫描，这些扫描在年份日期格式 yyyy-mm-dd 中出现多次

请看下面的数据

structure(list(id = c(1010001, 1010002, 1010004, 1010005, 1010006, 
1010007), `GFR Scans?` = c("Yes", "Yes", "Yes", "Yes", "Yes", 
"No"), `1. Date of renal scan:` = structure(c(1133913600, 1196812800, 
1237334400, 1124150400, 1192060800, NA), class = c("POSIXct", 
"POSIXt"), tzone = "UTC"), `1. Type of renal scan:` = c("DTPA", 
"DTPA", "DTPA", "DTPA", "DTPA", NA), `1. GFR mL/1.73 sq.m` = c(18, 
13, 68, 117, 46, NA), `1. Pre/Post tx?` = c("Pre", "Pre", "Post", 
"Post", "Pre", NA), `2. Date of renal scan:` = structure(c(1146528000, 
1214524800, NA, 1151366400, 1245974400, NA), class = c("POSIXct", 
"POSIXt"), tzone = "UTC"), `2. Type of renal scan:` = c("DTPA", 
"DTPA", NA, "DTPA", "DTPA", NA), `2. GFR mL/1.73 sq.m` = c(86, 
110, NA, 148, 123, NA), `2. Pre/Post tx?` = c("Post", "Post", 
NA, "Post", "Post", NA), `3. Date of renal scan:` = structure(c(NA, 
1219104000, NA, 1184025600, NA, NA), class = c("POSIXct", "POSIXt"
), tzone = "UTC"), `3. Type of renal scan:` = c(NA, "DTPA", NA, 
"DTPA", NA, NA), `3. GFR mL/1.73 sq.m` = c(NA, 92, NA, 166, NA, 
NA), `3. Pre/Post tx?` = c(NA, "Post", NA, "Post", NA, NA), `4. Date of    renal scan:` = structure(c(NA, 
1242691200, NA, 1213660800, NA, NA), class = c("POSIXct", "POSIXt"
), tzone = "UTC"), `4. Type of renal scan:` = c(NA, "DTPA", NA, 
"DTPA", NA, NA), `4. GFR mL/1.73 sq.m` = c(NA, 36, NA, 171, NA, 
NA), `4. Pre/Post tx?` = c(NA, "Post", NA, "Post", NA, NA), `5. Date of    renal scan:` = structure(c(NA, 
NA, NA, 1288656000, NA, NA), class = c("POSIXct", "POSIXt"), tzone =  "UTC"), 
    `5. Type of renal scan:` = c(NA, NA, NA, "DTPA", NA, NA), 
    `5. GFR mL/1.73 sq.m` = c(NA, NA, NA, 105, NA, NA), `5. Pre/Post  tx?` = c(NA, 
    NA, NA, "Post", NA, NA), `6. Date of renal scan:` = structure(c(NA, 
    NA, NA, 1323129600, NA, NA), class = c("POSIXct", "POSIXt"
    ), tzone = "UTC"), `6. Type of renal scan:` = c(NA, NA, NA, 
    "DTPA", NA, NA), `6. GFR mL/1.73 sq.m` = c(NA, NA, NA, 103, 
    NA, NA), `6. Pre/Post tx?` = c(NA, NA, NA, "Post", NA, NA
    ), `7. Date of renal scan:` = structure(c(NA, NA, NA, 1355184000, 
    NA, NA), class = c("POSIXct", "POSIXt"), tzone = "UTC"), 
    `7. Type of renal scan:` = c(NA, NA, NA, "DTPA", NA, NA), 
    `7. GFR mL/1.73 sq.m` = c(NA, NA, NA, 98, NA, NA), `7. Pre/Post tx?` = c(NA, 
    NA, NA, "Post", NA, NA), `8. Date of renal scan:` = c(NA, 
    NA, NA, NA, NA, NA), `8. Type of renal scan:` = c(NA, NA, 
    NA, NA, NA, NA), `8. GFR mL/1.73 sq.m` = c(NA, NA, NA, NA, 
    NA, NA), `8. Pre/Post tx?` = c(NA, NA, NA, NA, NA, NA), `9. Date of renal scan:` = c(NA, 
    NA, NA, NA, NA, NA), `9. Type of renal scan:` = c(NA, NA, 
    NA, NA, NA, NA), `9. GFR mL/1.73 sq.m` = c(NA, NA, NA, NA, 
    NA, NA), `9. Pre/Post tx?` = c(NA, NA, NA, NA, NA, NA), `10. Date   of renal scan:` = c(NA, 
    NA, NA, NA, NA, NA), `10. Type of renal scan:` = c(NA, NA, 
    NA, NA, NA, NA), `10. GFR mL/1.73 sq.m` = c(NA, NA, NA, NA, 
    NA, NA), `10. Pre/Post tx?` = c(NA, NA, NA, NA, NA, NA), 
    `11. Date of renal scan:` = c(NA, NA, NA, NA, NA, NA), `11. Type of  renal scan:` = c(NA, 
    NA, NA, NA, NA, NA), `11. GFR mL/1.73 sq.m` = c(NA, NA, NA, 
    NA, NA, NA), `11. Pre/Post tx?` = c(NA, NA, NA, NA, NA, NA
    ), `12. Date of renal scan:` = c(NA, NA, NA, NA, NA, NA), 
    `12. Type of renal scan:` = c(NA, NA, NA, NA, NA, NA), `12. GFR mL/1.73 sq.m` = c(NA, 
    NA, NA, NA, NA, NA), `12. Pre/Post tx?` = c(NA, NA, NA, NA, 
    NA, NA), `13. Date of renal scan:` = c(NA, NA, NA, NA, NA, 
    NA), `13. Type of renal scan:` = c(NA, NA, NA, NA, NA, NA
    ), `13. GFR mL/1.73 sq.m` = c(NA, NA, NA, NA, NA, NA), `13. Pre/Post tx?` = c(NA, 
    NA, NA, NA, NA, NA), `14. Date of renal scan:` = c(NA, NA, 
    NA, NA, NA, NA), `14. Type of renal scan:` = c(NA, NA, NA, 
    NA, NA, NA), `14. GFR mL/1.73 sq.m` = c(NA, NA, NA, NA, NA, 
    NA), `14. Pre/Post tx?` = c(NA, NA, NA, NA, NA, NA), `15. Date of renal scan:` = c(NA, 
    NA, NA, NA, NA, NA), `15. Type of renal scan:` = c(NA, NA, 
    NA, NA, NA, NA), `15. GFR mL/1.73 sq.m` = c(NA, NA, NA, NA, 
    NA, NA), `15. Pre/Post tx?` = c(NA, NA, NA, NA, NA, NA), 
    `16. Date of renal scan:` = c(NA, NA, NA, NA, NA, NA), `16. Type of  renal scan:` = c(NA, 
    NA, NA, NA, NA, NA), `16. GFR mL/1.73 sq.m` = c(NA, NA, NA, 
    NA, NA, NA), `16. Pre/Post tx?` = c(NA, NA, NA, NA, NA, NA
    ), `17. Date of renal scan:` = c(NA, NA, NA, NA, NA, NA), 
    `17. Type of renal scan:` = c(NA, NA, NA, NA, NA, NA), `17. GFR mL/1.73 sq.m` = c(NA, 
    NA, NA, NA, NA, NA), `17. Pre/Post tx?` = c(NA, NA, NA, NA, 
    NA, NA)), row.names = c(NA, -6L), class = c("tbl_df", "tbl", "data.frame"))

第一张图片是原始的宽格式，第二张图片是我想要得到的。由于我涉及多个专栏，因此没有其他关于此的广泛到冗长的答案对我有帮助。

例如id 1010001 已经进行了两次扫描，我需要一个接一个地列出，而不是放在一起（见图二）。

非常感谢您的帮助。

【问题讨论】：

所以思路是把表排序成ID，把第二、第三组移到第一组？
是的，按 ID 分组，然后在下面列出后续扫描，而不是并排。一些 ID 有多达 17 次扫描（侧面的列）。
还有一些 ID 没有收到任何扫描 - 列为否。这些也需要列出，这些只有一行，因为没有后续链接的列

标签： python r excel dataframe reshape

【解决方案1】：

这是一个有效的解决方案，不是最好的，但有效。策略是从宽到长，然后到整齐的格式。

当从原始的宽格式转换为长格式时，所有列都被转换为最低通用格式，在这种情况下是字符，因此需要在最后进行列转换。

为了删除带有 NA 的行，我使用 complete.cases，因此您的最后一个 id 1010007 不在最终输出中。如果这是一个问题，您应该调整 NA 清理步骤的位置。

library(tidyr)
library(dplyr)

#convert from wide to long
new<-gather(df,key = "key", value = "value", -id, -`GFR Scans?`)
#clean up the key column
new$key<-sub("[0-9]+\\. ", "", new$key)
new$key<-gsub("[ ]+", " ", new$key)

# verify column headings (should only be 4)
unique(new$key)
#remove the rows with NA
new<-new[complete.cases(new),]

#now go from long to slightly wide
answer<-new %>% group_by( id, `GFR Scans?`, key) %>% mutate(testnum=row_number()) %>% spread(key, value)  

#convert the colmns back to the proper type
answer$`Date of renal scan:`<-as.POSIXct(as.numeric(answer$`Date of renal scan:`), origin="1970-01-01", tz="UTC")
answer$`GFR mL/1.73 sq.m`<-as.numeric(answer$`GFR mL/1.73 sq.m`)
answer

# id `GFR Scans?` testnum `Date of renal scan:` `GFR mL/1.73 sq.m` `Pre/Post tx?` `Type of renal scan:`
#     <dbl> <chr>          <int> <dttm>                             <dbl> <chr>          <chr>                
# 1 1010001 Yes                1 2005-12-07 00:00:00                   18 Pre            DTPA                 
# 2 1010001 Yes                2 2006-05-02 00:00:00                   86 Post           DTPA                 
# 3 1010002 Yes                1 2007-12-05 00:00:00                   13 Pre            DTPA                 
# 4 1010002 Yes                2 2008-06-27 00:00:00                  110 Post           DTPA                 
# 5 1010002 Yes                3 2008-08-19 00:00:00                   92 Post           DTPA                 
# 6 1010002 Yes                4 2009-05-19 00:00:00                   36 Post           DTPA                 
# 7 1010004 Yes                1 2009-03-18 00:00:00                   68 Post           DTPA                 
# 8 1010005 Yes                1 2005-08-16 00:00:00                  117 Post           DTPA

【讨论】：

【解决方案2】：

这个问题已经被问过好几次了，例如Reshaping multiple sets of measurement columns (wide format) into single columns (long format)。一种可能的方法是使用data.table 的melt() 函数，该函数能够同时重塑多个值列。

但是，恕我直言，这是一个额外的困难，它可以证明自己的答案是正确的。列名偶尔会包含多余的空格，需要预先删除这些空格，以便为列提供一致的命名模式。

names(df1)

 [1] "id"                        "GFR Scans?"                "1. Date of renal scan:"    "1. Type of renal scan:"   
 [5] "1. GFR mL/1.73 sq.m"       "1. Pre/Post tx?"           "2. Date of renal scan:"    "2. Type of renal scan:"   
 [9] "2. GFR mL/1.73 sq.m"       "2. Pre/Post tx?"           "3. Date of renal scan:"    "3. Type of renal scan:"   
[13] "3. GFR mL/1.73 sq.m"       "3. Pre/Post tx?"           "4. Date of    renal scan:" "4. Type of renal scan:"   
[17] "4. GFR mL/1.73 sq.m"       "4. Pre/Post tx?"           "5. Date of    renal scan:" "5. Type of renal scan:"   
[21] "5. GFR mL/1.73 sq.m"       "5. Pre/Post  tx?"          "6. Date of renal scan:"    "6. Type of renal scan:"   
[25] "6. GFR mL/1.73 sq.m"       "6. Pre/Post tx?"           "7. Date of renal scan:"    "7. Type of renal scan:"   
[29] "7. GFR mL/1.73 sq.m"       "7. Pre/Post tx?"           "8. Date of renal scan:"    "8. Type of renal scan:"   
[33] "8. GFR mL/1.73 sq.m"       "8. Pre/Post tx?"           "9. Date of renal scan:"    "9. Type of renal scan:"   
[37] "9. GFR mL/1.73 sq.m"       "9. Pre/Post tx?"           "10. Date   of renal scan:" "10. Type of renal scan:"  
[41] "10. GFR mL/1.73 sq.m"      "10. Pre/Post tx?"          "11. Date of renal scan:"   "11. Type of  renal scan:" 
[45] "11. GFR mL/1.73 sq.m"      "11. Pre/Post tx?"          "12. Date of renal scan:"   "12. Type of renal scan:"  
[49] "12. GFR mL/1.73 sq.m"      "12. Pre/Post tx?"          "13. Date of renal scan:"   "13. Type of renal scan:"  
[53] "13. GFR mL/1.73 sq.m"      "13. Pre/Post tx?"          "14. Date of renal scan:"   "14. Type of renal scan:"  
[57] "14. GFR mL/1.73 sq.m"      "14. Pre/Post tx?"          "15. Date of renal scan:"   "15. Type of renal scan:"  
[61] "15. GFR mL/1.73 sq.m"      "15. Pre/Post tx?"          "16. Date of renal scan:"   "16. Type of  renal scan:" 
[65] "16. GFR mL/1.73 sq.m"      "16. Pre/Post tx?"          "17. Date of renal scan:"   "17. Type of renal scan:"

library(data.table)
library(magrittr)
# clean up column names: remove surplus whitespace
setDT(df1) %>% setnames(names(.) %>% stringr::str_replace_all("\\s+", " "))
# get name pattern for subsequent melt
cols <- names(df1)[3:6] %>% stringr::str_replace("1. ", "")
# reshape multiple columns from wide to long
long <- melt(df1, measure.vars = patterns(cols), value.name = cols, na.rm = TRUE)[
  # recreate lost POSIXct attribute
  , `Date of renal scan:` := lubridate::as_datetime(`Date of renal scan:`)][]

long

         id GFR Scans? variable Date of renal scan: Type of renal scan: GFR mL/1.73 sq.m Pre/Post tx?
 1: 1010001        Yes        1          2005-12-07                DTPA               18          Pre
 2: 1010002        Yes        1          2007-12-05                DTPA               13          Pre
 3: 1010004        Yes        1          2009-03-18                DTPA               68         Post
 4: 1010005        Yes        1          2005-08-16                DTPA              117         Post
 5: 1010006        Yes        1          2007-10-11                DTPA               46          Pre
 6: 1010001        Yes        2          2006-05-02                DTPA               86         Post
 7: 1010002        Yes        2          2008-06-27                DTPA              110         Post
 8: 1010005        Yes        2          2006-06-27                DTPA              148         Post
 9: 1010006        Yes        2          2009-06-26                DTPA              123         Post
10: 1010002        Yes        3          2008-08-19                DTPA               92         Post
11: 1010005        Yes        3          2007-07-10                DTPA              166         Post
12: 1010002        Yes        4          2009-05-19                DTPA               36         Post
13: 1010005        Yes        4          2008-06-17                DTPA              171         Post
14: 1010005        Yes        5          2010-11-02                DTPA              105         Post
15: 1010005        Yes        6          2011-12-06                DTPA              103         Post
16: 1010005        Yes        7          2012-12-11                DTPA               98         Post

在对melt()的调用中，我们可以设置参数na.rm = FALSE来保留所有数据：

          id GFR Scans? variable Date of renal scan: Type of renal scan: GFR mL/1.73 sq.m Pre/Post tx?
  1: 1010001        Yes        1          2005-12-07                DTPA               18          Pre
  2: 1010002        Yes        1          2007-12-05                DTPA               13          Pre
  3: 1010004        Yes        1          2009-03-18                DTPA               68         Post
  4: 1010005        Yes        1          2005-08-16                DTPA              117         Post
  5: 1010006        Yes        1          2007-10-11                DTPA               46          Pre
 ---                                                                                                  
 98: 1010002        Yes       17                <NA>                <NA>               NA         <NA>
 99: 1010004        Yes       17                <NA>                <NA>               NA         <NA>
100: 1010005        Yes       17                <NA>                <NA>               NA         <NA>
101: 1010006        Yes       17                <NA>                <NA>               NA         <NA>
102: 1010007         No       17                <NA>                <NA>               NA         <NA>

【讨论】：