【问题标题】:Need to Convert from Wide to Long需要从宽转换为长
【发布时间】:2019-07-17 00:36:25
【问题描述】:

嗨, 我在 A 列中有这个具有唯一 Id 变量的数据集,然后是每个患者的后续肾脏扫描。这是一个 csv 文件,如果可能的话,我想使用 R 将其重塑为长格式。 每个参与者可以进行 1-17 次的肾脏扫描。

还有一些 ID 被列为“否”,因为没有接收到扫描。 我希望它被重新塑造成类似的东西

我知道以前按年份组织的有关此组织的问题,我有来自参与者的扫描,这些扫描在年份日期格式 yyyy-mm-dd 中出现多次

请看下面的数据

structure(list(id = c(1010001, 1010002, 1010004, 1010005, 1010006, 
1010007), `GFR Scans?` = c("Yes", "Yes", "Yes", "Yes", "Yes", 
"No"), `1. Date of renal scan:` = structure(c(1133913600, 1196812800, 
1237334400, 1124150400, 1192060800, NA), class = c("POSIXct", 
"POSIXt"), tzone = "UTC"), `1. Type of renal scan:` = c("DTPA", 
"DTPA", "DTPA", "DTPA", "DTPA", NA), `1. GFR mL/1.73 sq.m` = c(18, 
13, 68, 117, 46, NA), `1. Pre/Post tx?` = c("Pre", "Pre", "Post", 
"Post", "Pre", NA), `2. Date of renal scan:` = structure(c(1146528000, 
1214524800, NA, 1151366400, 1245974400, NA), class = c("POSIXct", 
"POSIXt"), tzone = "UTC"), `2. Type of renal scan:` = c("DTPA", 
"DTPA", NA, "DTPA", "DTPA", NA), `2. GFR mL/1.73 sq.m` = c(86, 
110, NA, 148, 123, NA), `2. Pre/Post tx?` = c("Post", "Post", 
NA, "Post", "Post", NA), `3. Date of renal scan:` = structure(c(NA, 
1219104000, NA, 1184025600, NA, NA), class = c("POSIXct", "POSIXt"
), tzone = "UTC"), `3. Type of renal scan:` = c(NA, "DTPA", NA, 
"DTPA", NA, NA), `3. GFR mL/1.73 sq.m` = c(NA, 92, NA, 166, NA, 
NA), `3. Pre/Post tx?` = c(NA, "Post", NA, "Post", NA, NA), `4. Date of    renal scan:` = structure(c(NA, 
1242691200, NA, 1213660800, NA, NA), class = c("POSIXct", "POSIXt"
), tzone = "UTC"), `4. Type of renal scan:` = c(NA, "DTPA", NA, 
"DTPA", NA, NA), `4. GFR mL/1.73 sq.m` = c(NA, 36, NA, 171, NA, 
NA), `4. Pre/Post tx?` = c(NA, "Post", NA, "Post", NA, NA), `5. Date of    renal scan:` = structure(c(NA, 
NA, NA, 1288656000, NA, NA), class = c("POSIXct", "POSIXt"), tzone =  "UTC"), 
    `5. Type of renal scan:` = c(NA, NA, NA, "DTPA", NA, NA), 
    `5. GFR mL/1.73 sq.m` = c(NA, NA, NA, 105, NA, NA), `5. Pre/Post  tx?` = c(NA, 
    NA, NA, "Post", NA, NA), `6. Date of renal scan:` = structure(c(NA, 
    NA, NA, 1323129600, NA, NA), class = c("POSIXct", "POSIXt"
    ), tzone = "UTC"), `6. Type of renal scan:` = c(NA, NA, NA, 
    "DTPA", NA, NA), `6. GFR mL/1.73 sq.m` = c(NA, NA, NA, 103, 
    NA, NA), `6. Pre/Post tx?` = c(NA, NA, NA, "Post", NA, NA
    ), `7. Date of renal scan:` = structure(c(NA, NA, NA, 1355184000, 
    NA, NA), class = c("POSIXct", "POSIXt"), tzone = "UTC"), 
    `7. Type of renal scan:` = c(NA, NA, NA, "DTPA", NA, NA), 
    `7. GFR mL/1.73 sq.m` = c(NA, NA, NA, 98, NA, NA), `7. Pre/Post tx?` = c(NA, 
    NA, NA, "Post", NA, NA), `8. Date of renal scan:` = c(NA, 
    NA, NA, NA, NA, NA), `8. Type of renal scan:` = c(NA, NA, 
    NA, NA, NA, NA), `8. GFR mL/1.73 sq.m` = c(NA, NA, NA, NA, 
    NA, NA), `8. Pre/Post tx?` = c(NA, NA, NA, NA, NA, NA), `9. Date of renal scan:` = c(NA, 
    NA, NA, NA, NA, NA), `9. Type of renal scan:` = c(NA, NA, 
    NA, NA, NA, NA), `9. GFR mL/1.73 sq.m` = c(NA, NA, NA, NA, 
    NA, NA), `9. Pre/Post tx?` = c(NA, NA, NA, NA, NA, NA), `10. Date   of renal scan:` = c(NA, 
    NA, NA, NA, NA, NA), `10. Type of renal scan:` = c(NA, NA, 
    NA, NA, NA, NA), `10. GFR mL/1.73 sq.m` = c(NA, NA, NA, NA, 
    NA, NA), `10. Pre/Post tx?` = c(NA, NA, NA, NA, NA, NA), 
    `11. Date of renal scan:` = c(NA, NA, NA, NA, NA, NA), `11. Type of  renal scan:` = c(NA, 
    NA, NA, NA, NA, NA), `11. GFR mL/1.73 sq.m` = c(NA, NA, NA, 
    NA, NA, NA), `11. Pre/Post tx?` = c(NA, NA, NA, NA, NA, NA
    ), `12. Date of renal scan:` = c(NA, NA, NA, NA, NA, NA), 
    `12. Type of renal scan:` = c(NA, NA, NA, NA, NA, NA), `12. GFR mL/1.73 sq.m` = c(NA, 
    NA, NA, NA, NA, NA), `12. Pre/Post tx?` = c(NA, NA, NA, NA, 
    NA, NA), `13. Date of renal scan:` = c(NA, NA, NA, NA, NA, 
    NA), `13. Type of renal scan:` = c(NA, NA, NA, NA, NA, NA
    ), `13. GFR mL/1.73 sq.m` = c(NA, NA, NA, NA, NA, NA), `13. Pre/Post tx?` = c(NA, 
    NA, NA, NA, NA, NA), `14. Date of renal scan:` = c(NA, NA, 
    NA, NA, NA, NA), `14. Type of renal scan:` = c(NA, NA, NA, 
    NA, NA, NA), `14. GFR mL/1.73 sq.m` = c(NA, NA, NA, NA, NA, 
    NA), `14. Pre/Post tx?` = c(NA, NA, NA, NA, NA, NA), `15. Date of renal scan:` = c(NA, 
    NA, NA, NA, NA, NA), `15. Type of renal scan:` = c(NA, NA, 
    NA, NA, NA, NA), `15. GFR mL/1.73 sq.m` = c(NA, NA, NA, NA, 
    NA, NA), `15. Pre/Post tx?` = c(NA, NA, NA, NA, NA, NA), 
    `16. Date of renal scan:` = c(NA, NA, NA, NA, NA, NA), `16. Type of  renal scan:` = c(NA, 
    NA, NA, NA, NA, NA), `16. GFR mL/1.73 sq.m` = c(NA, NA, NA, 
    NA, NA, NA), `16. Pre/Post tx?` = c(NA, NA, NA, NA, NA, NA
    ), `17. Date of renal scan:` = c(NA, NA, NA, NA, NA, NA), 
    `17. Type of renal scan:` = c(NA, NA, NA, NA, NA, NA), `17. GFR mL/1.73 sq.m` = c(NA, 
    NA, NA, NA, NA, NA), `17. Pre/Post tx?` = c(NA, NA, NA, NA, 
    NA, NA)), row.names = c(NA, -6L), class = c("tbl_df", "tbl", "data.frame"))

第一张图片是原始的宽格式,第二张图片是我想要得到的。由于我涉及多个专栏,因此没有其他关于此的广泛到冗长的答案对我有帮助。

例如id 1010001 已经进行了两次扫描,我需要一个接一个地列出,而不是放在一起(见图二)。

非常感谢您的帮助。

【问题讨论】:

  • 所以思路是把表排序成ID,把第二、第三组移到第一组?
  • 是的,按 ID 分组,然后在下面列出后续扫描,而不是并排。一些 ID 有多达 17 次扫描(侧面的列)。
  • 还有一些 ID 没有收到任何扫描 - 列为否。这些也需要列出,这些只有一行,因为没有后续链接的列

标签: python r excel dataframe reshape


【解决方案1】:

这是一个有效的解决方案,不是最好的,但有效。策略是从宽到长,然后到整齐的格式。

当从原始的宽格式转换为长格式时,所有列都被转换为最低通用格式,在这种情况下是字符,因此需要在最后进行列转换。

为了删除带有 NA 的行,我使用 complete.cases,因此您的最后一个 id 1010007 不在最终输出中。如果这是一个问题,您应该调整 NA 清理步骤的位置。

library(tidyr)
library(dplyr)

#convert from wide to long
new<-gather(df,key = "key", value = "value", -id, -`GFR Scans?`)
#clean up the key column
new$key<-sub("[0-9]+\\. ", "", new$key)
new$key<-gsub("[ ]+", " ", new$key)

# verify column headings (should only be 4)
unique(new$key)
#remove the rows with NA
new<-new[complete.cases(new),]

#now go from long to slightly wide
answer<-new %>% group_by( id, `GFR Scans?`, key) %>% mutate(testnum=row_number()) %>% spread(key, value)  

#convert the colmns back to the proper type
answer$`Date of renal scan:`<-as.POSIXct(as.numeric(answer$`Date of renal scan:`), origin="1970-01-01", tz="UTC")
answer$`GFR mL/1.73 sq.m`<-as.numeric(answer$`GFR mL/1.73 sq.m`)
answer

# id `GFR Scans?` testnum `Date of renal scan:` `GFR mL/1.73 sq.m` `Pre/Post tx?` `Type of renal scan:`
#     <dbl> <chr>          <int> <dttm>                             <dbl> <chr>          <chr>                
# 1 1010001 Yes                1 2005-12-07 00:00:00                   18 Pre            DTPA                 
# 2 1010001 Yes                2 2006-05-02 00:00:00                   86 Post           DTPA                 
# 3 1010002 Yes                1 2007-12-05 00:00:00                   13 Pre            DTPA                 
# 4 1010002 Yes                2 2008-06-27 00:00:00                  110 Post           DTPA                 
# 5 1010002 Yes                3 2008-08-19 00:00:00                   92 Post           DTPA                 
# 6 1010002 Yes                4 2009-05-19 00:00:00                   36 Post           DTPA                 
# 7 1010004 Yes                1 2009-03-18 00:00:00                   68 Post           DTPA                 
# 8 1010005 Yes                1 2005-08-16 00:00:00                  117 Post           DTPA  

【讨论】:

    【解决方案2】:

    这个问题已经被问过好几次了,例如Reshaping multiple sets of measurement columns (wide format) into single columns (long format)。一种可能的方法是使用data.tablemelt() 函数,该函数能够同时重塑多个值列。

    但是,恕我直言,这是一个额外的困难,它可以证明自己的答案是正确的。 列名偶尔会包含多余的空格,需要预先删除这些空格,以便为列提供一致的命名模式。

    names(df1)
    
     [1] "id"                        "GFR Scans?"                "1. Date of renal scan:"    "1. Type of renal scan:"   
     [5] "1. GFR mL/1.73 sq.m"       "1. Pre/Post tx?"           "2. Date of renal scan:"    "2. Type of renal scan:"   
     [9] "2. GFR mL/1.73 sq.m"       "2. Pre/Post tx?"           "3. Date of renal scan:"    "3. Type of renal scan:"   
    [13] "3. GFR mL/1.73 sq.m"       "3. Pre/Post tx?"           "4. Date of    renal scan:" "4. Type of renal scan:"   
    [17] "4. GFR mL/1.73 sq.m"       "4. Pre/Post tx?"           "5. Date of    renal scan:" "5. Type of renal scan:"   
    [21] "5. GFR mL/1.73 sq.m"       "5. Pre/Post  tx?"          "6. Date of renal scan:"    "6. Type of renal scan:"   
    [25] "6. GFR mL/1.73 sq.m"       "6. Pre/Post tx?"           "7. Date of renal scan:"    "7. Type of renal scan:"   
    [29] "7. GFR mL/1.73 sq.m"       "7. Pre/Post tx?"           "8. Date of renal scan:"    "8. Type of renal scan:"   
    [33] "8. GFR mL/1.73 sq.m"       "8. Pre/Post tx?"           "9. Date of renal scan:"    "9. Type of renal scan:"   
    [37] "9. GFR mL/1.73 sq.m"       "9. Pre/Post tx?"           "10. Date   of renal scan:" "10. Type of renal scan:"  
    [41] "10. GFR mL/1.73 sq.m"      "10. Pre/Post tx?"          "11. Date of renal scan:"   "11. Type of  renal scan:" 
    [45] "11. GFR mL/1.73 sq.m"      "11. Pre/Post tx?"          "12. Date of renal scan:"   "12. Type of renal scan:"  
    [49] "12. GFR mL/1.73 sq.m"      "12. Pre/Post tx?"          "13. Date of renal scan:"   "13. Type of renal scan:"  
    [53] "13. GFR mL/1.73 sq.m"      "13. Pre/Post tx?"          "14. Date of renal scan:"   "14. Type of renal scan:"  
    [57] "14. GFR mL/1.73 sq.m"      "14. Pre/Post tx?"          "15. Date of renal scan:"   "15. Type of renal scan:"  
    [61] "15. GFR mL/1.73 sq.m"      "15. Pre/Post tx?"          "16. Date of renal scan:"   "16. Type of  renal scan:" 
    [65] "16. GFR mL/1.73 sq.m"      "16. Pre/Post tx?"          "17. Date of renal scan:"   "17. Type of renal scan:"
    
    library(data.table)
    library(magrittr)
    # clean up column names: remove surplus whitespace
    setDT(df1) %>% setnames(names(.) %>% stringr::str_replace_all("\\s+", " "))
    # get name pattern for subsequent melt
    cols <- names(df1)[3:6] %>% stringr::str_replace("1. ", "")
    # reshape multiple columns from wide to long
    long <- melt(df1, measure.vars = patterns(cols), value.name = cols, na.rm = TRUE)[
      # recreate lost POSIXct attribute
      , `Date of renal scan:` := lubridate::as_datetime(`Date of renal scan:`)][]
    
    long
    
             id GFR Scans? variable Date of renal scan: Type of renal scan: GFR mL/1.73 sq.m Pre/Post tx?
     1: 1010001        Yes        1          2005-12-07                DTPA               18          Pre
     2: 1010002        Yes        1          2007-12-05                DTPA               13          Pre
     3: 1010004        Yes        1          2009-03-18                DTPA               68         Post
     4: 1010005        Yes        1          2005-08-16                DTPA              117         Post
     5: 1010006        Yes        1          2007-10-11                DTPA               46          Pre
     6: 1010001        Yes        2          2006-05-02                DTPA               86         Post
     7: 1010002        Yes        2          2008-06-27                DTPA              110         Post
     8: 1010005        Yes        2          2006-06-27                DTPA              148         Post
     9: 1010006        Yes        2          2009-06-26                DTPA              123         Post
    10: 1010002        Yes        3          2008-08-19                DTPA               92         Post
    11: 1010005        Yes        3          2007-07-10                DTPA              166         Post
    12: 1010002        Yes        4          2009-05-19                DTPA               36         Post
    13: 1010005        Yes        4          2008-06-17                DTPA              171         Post
    14: 1010005        Yes        5          2010-11-02                DTPA              105         Post
    15: 1010005        Yes        6          2011-12-06                DTPA              103         Post
    16: 1010005        Yes        7          2012-12-11                DTPA               98         Post
    

    在对melt()的调用中,我们可以设置参数na.rm = FALSE来保留所有数据:

              id GFR Scans? variable Date of renal scan: Type of renal scan: GFR mL/1.73 sq.m Pre/Post tx?
      1: 1010001        Yes        1          2005-12-07                DTPA               18          Pre
      2: 1010002        Yes        1          2007-12-05                DTPA               13          Pre
      3: 1010004        Yes        1          2009-03-18                DTPA               68         Post
      4: 1010005        Yes        1          2005-08-16                DTPA              117         Post
      5: 1010006        Yes        1          2007-10-11                DTPA               46          Pre
     ---                                                                                                  
     98: 1010002        Yes       17                <NA>                <NA>               NA         <NA>
     99: 1010004        Yes       17                <NA>                <NA>               NA         <NA>
    100: 1010005        Yes       17                <NA>                <NA>               NA         <NA>
    101: 1010006        Yes       17                <NA>                <NA>               NA         <NA>
    102: 1010007         No       17                <NA>                <NA>               NA         <NA>
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2016-11-26
      • 2020-08-11
      • 1970-01-01
      • 2016-07-29
      • 1970-01-01
      • 1970-01-01
      • 2018-01-31
      • 1970-01-01
      相关资源
      最近更新 更多