在R中对具有相同名称的列进行分组答案

【问题标题】：Grouping columns with the same name in R在R中对具有相同名称的列进行分组
【发布时间】：2018-03-30 20:29:53
【问题描述】：

我正在尝试以“可读”方式格式化我的数据，其中我有多个具有相同名称的列。我尝试使用melt()函数，但未能解决问题，这似乎与变量上存在不同值有关。

数据的一个小例子：

obs     m   ti      td        date        class  code   dis       group  status     grade   freq    date              dis     group   status    grade   freq    date             dis    group   status  grade   freq    date
obs_1   A   grad        05/01/2016 00:00         55060  DDE0300  2016101    A        5.7     97   05/01/2016 15:20  MS0230  2016101      A      8.19    100 05/01/2016 15:20    A0301   2016101  A        5.8   100  27/01/2016 13:12
obs_2   A   grad        05/01/2016 00:00         55070  SSE332         0    D                     03/06/2016 14:08   A0804    0          D                  03/06/2016 14:18    SE089   0        D                   26/08/2016 19:31

现在我想通过观察来分割这个数据框：

    melt(df[1,],id.vars=c("obs","m","ti","td","date","class","code"), 
            measure.vars=c("dis","group","status","grade","freq","date"))

我明白了：

    obs  m   ti td             date class  code variable            value
1 obs_1 A  grad NA 05/01/2016 15:20    NA 55060      dis          DDE0300
2 obs_1 A  grad NA 05/01/2016 15:20    NA 55060    group          2016101
3 obs_1 A  grad NA 05/01/2016 15:20    NA 55060   status               A 
4 obs_1 A  grad NA 05/01/2016 15:20    NA 55060    grade              5.7
5 obs_1 A  grad NA 05/01/2016 15:20    NA 55060     freq               97
6 obs_1 A  grad NA 05/01/2016 15:20    NA 55060     date 05/01/2016 15:20
Warning message:
attributes are not identical across measure variables; they will be dropped

现在，我“缺少”两列，分别是 MS0230 和 A0301 以及它们的组、状态等。我该如何解决这个问题？

请记住，它不一定要使用 melt() 函数。

重现数据的代码：

df<-structure(list(obs = structure(1:2, .Label = c("obs_1", "obs_2"
), class = "factor"), m = structure(c(1L, 1L), .Label = "A ", class = "factor"), 
    ti = structure(c(1L, 1L), .Label = "grad", class = "factor"), 
    td = c(NA, NA), datei = structure(c(1L, 1L), .Label = "05/01/2016 00:00", class = "factor"), 
    class = c(NA, NA), code = c(55060L, 55070L), dis = structure(1:2, .Label = c("DDE0300", 
    "SSE332"), class = "factor"), group = c(2016101L, 0L), status = structure(1:2, .Label = c("A ", 
    "D "), class = "factor"), grade = c(5.7, NA), freq = c(97L, 
    NA), date = structure(c(2L, 1L), .Label = c("03/06/2016 14:08", 
    "05/01/2016 15:20"), class = "factor"), dis = structure(c(2L, 
    1L), .Label = c("A0804", "MS0230"), class = "factor"), group = c(2016101L, 
    0L), status = structure(1:2, .Label = c("A ", "D "), class = "factor"), 
    grade = c(8.19, NA), freq = c(100L, NA), date = structure(c(2L, 
    1L), .Label = c("03/06/2016 14:18", "05/01/2016 15:20"), class = "factor"), 
    dis = structure(1:2, .Label = c("A0301", "SE089"), class = "factor"), 
    group = c(2016101L, 0L), status = structure(1:2, .Label = c("A ", 
    "D "), class = "factor"), grade = c(5.8, NA), freq = c(100L, 
    NA), date = structure(c(2L, 1L), .Label = c("26/08/2016 19:31", 
    "27/01/2016 13:12"), class = "factor")), .Names = c("obs", 
"m", "ti", "td", "datei", "class", "code", "dis", "group", "status", 
"grade", "freq", "date", "dis", "group", "status", "grade", "freq", 
"date", "dis", "group", "status", "grade", "freq", "date"), class = "data.frame", row.names = c(NA, 
-2L))

【问题讨论】：

似乎是 Reshaping multiple sets of measurement columns (wide format) into single columns (long format) 的副本。尝试例如reshape(df, idvar = "obs", direction = "long", varying = list(dis = c(8, 14, 20), group = c(9, 15, 21), status = c(10, 16, 22), grade = c(11, 17, 23), freq = c(12, 18, 24), date = c(13, 19, 25)))
请显示想要的结果。您很明显 MS0230 和 A0301 应该是 melt 之后的列。

标签： r dataframe data-manipulation

【解决方案1】：

感谢 Henrik 的链接，我设法弄明白了。不知道这是否是最好的解决方案。

但这就是我所做的：

melt(setDT(df[1,]), id=1L, id.vars=c("obs","m","ti","td","date","class","code"),
      measure=patterns("dis","group","status","grade","freq","date"),
      value.name=c("Dis","Group","Status","Grade","Freq","Date"))

这给了我：

     obs  m   ti td             date class  code variable     Dis   Group Status Grade Freq             Date
1: obs_1 A  grad NA 05/01/2016 15:20    NA 55060        1 DDE0300 2016101     A   5.70   97 05/01/2016 00:00
2: obs_1 A  grad NA 05/01/2016 15:20    NA 55060        2  MS0230 2016101     A   8.19  100 05/01/2016 15:20
3: obs_1 A  grad NA 05/01/2016 15:20    NA 55060        3   A0301 2016101     A   5.80  100 05/01/2016 15:20
4: obs_1 A  grad NA 05/01/2016 15:20    NA 55060        4      NA      NA     NA    NA   NA 27/01/2016 13:12

【讨论】：

不完全是因为对于这种特殊情况，我只在第一行使用了 melt() 。所以我只拿到了DDE0300、MS0230和A0301。我不知道为什么那里有第四行。
是的，似乎第一个 'Date' 正在接 'date'，而最后一个 'Date' (27/01/2016 13:12) 是实际的最后一个 'Date'，任何修复它的想法？
我想我修好了。因此，在 patterns() 中，如果您将字符串键入为“date”，该函数将拾取具有字符串“date”的每一列并将它们融化。现在，使用“date$”似乎只获取与整个名称匹配的列。话虽这么说，在这种情况下，有 2 列名称为“日期”，但它似乎已修复它。融化(setDT(df[1,]), id=1L, id.vars=c("obs","m","ti","td","date","class","code"), measure=patterns("dis","group","status","grade","freq","date$"), value.name=c("Dis","Group","Status","Grade ","频率","日期"))