【问题标题】:Grouping columns with the same name in R在R中对具有相同名称的列进行分组
【发布时间】:2018-03-30 20:29:53
【问题描述】:

我正在尝试以“可读”方式格式化我的数据,其中我有多个具有相同名称的列。我尝试使用melt()函数,但未能解决问题,这似乎与变量上存在不同值有关。

数据的一个小例子:

obs     m   ti      td        date        class  code   dis       group  status     grade   freq    date              dis     group   status    grade   freq    date             dis    group   status  grade   freq    date
obs_1   A   grad        05/01/2016 00:00         55060  DDE0300  2016101    A        5.7     97   05/01/2016 15:20  MS0230  2016101      A      8.19    100 05/01/2016 15:20    A0301   2016101  A        5.8   100  27/01/2016 13:12
obs_2   A   grad        05/01/2016 00:00         55070  SSE332         0    D                     03/06/2016 14:08   A0804    0          D                  03/06/2016 14:18    SE089   0        D                   26/08/2016 19:31

现在我想通过观察来分割这个数据框:

    melt(df[1,],id.vars=c("obs","m","ti","td","date","class","code"), 
            measure.vars=c("dis","group","status","grade","freq","date"))

我明白了:

    obs  m   ti td             date class  code variable            value
1 obs_1 A  grad NA 05/01/2016 15:20    NA 55060      dis          DDE0300
2 obs_1 A  grad NA 05/01/2016 15:20    NA 55060    group          2016101
3 obs_1 A  grad NA 05/01/2016 15:20    NA 55060   status               A 
4 obs_1 A  grad NA 05/01/2016 15:20    NA 55060    grade              5.7
5 obs_1 A  grad NA 05/01/2016 15:20    NA 55060     freq               97
6 obs_1 A  grad NA 05/01/2016 15:20    NA 55060     date 05/01/2016 15:20
Warning message:
attributes are not identical across measure variables; they will be dropped 

现在,我“缺少”两列,分别是 MS0230 和 A0301 以及它们的组、状态等。我该如何解决这个问题?

请记住,它不一定要使用 melt() 函数。

重现数据的代码:

df<-structure(list(obs = structure(1:2, .Label = c("obs_1", "obs_2"
), class = "factor"), m = structure(c(1L, 1L), .Label = "A ", class = "factor"), 
    ti = structure(c(1L, 1L), .Label = "grad", class = "factor"), 
    td = c(NA, NA), datei = structure(c(1L, 1L), .Label = "05/01/2016 00:00", class = "factor"), 
    class = c(NA, NA), code = c(55060L, 55070L), dis = structure(1:2, .Label = c("DDE0300", 
    "SSE332"), class = "factor"), group = c(2016101L, 0L), status = structure(1:2, .Label = c("A ", 
    "D "), class = "factor"), grade = c(5.7, NA), freq = c(97L, 
    NA), date = structure(c(2L, 1L), .Label = c("03/06/2016 14:08", 
    "05/01/2016 15:20"), class = "factor"), dis = structure(c(2L, 
    1L), .Label = c("A0804", "MS0230"), class = "factor"), group = c(2016101L, 
    0L), status = structure(1:2, .Label = c("A ", "D "), class = "factor"), 
    grade = c(8.19, NA), freq = c(100L, NA), date = structure(c(2L, 
    1L), .Label = c("03/06/2016 14:18", "05/01/2016 15:20"), class = "factor"), 
    dis = structure(1:2, .Label = c("A0301", "SE089"), class = "factor"), 
    group = c(2016101L, 0L), status = structure(1:2, .Label = c("A ", 
    "D "), class = "factor"), grade = c(5.8, NA), freq = c(100L, 
    NA), date = structure(c(2L, 1L), .Label = c("26/08/2016 19:31", 
    "27/01/2016 13:12"), class = "factor")), .Names = c("obs", 
"m", "ti", "td", "datei", "class", "code", "dis", "group", "status", 
"grade", "freq", "date", "dis", "group", "status", "grade", "freq", 
"date", "dis", "group", "status", "grade", "freq", "date"), class = "data.frame", row.names = c(NA, 
-2L))

【问题讨论】:

标签: r dataframe data-manipulation


【解决方案1】:

感谢 Henrik 的链接,我设法弄明白了。不知道这是否是最好的解决方案。

但这就是我所做的:

melt(setDT(df[1,]), id=1L, id.vars=c("obs","m","ti","td","date","class","code"),
      measure=patterns("dis","group","status","grade","freq","date"),
      value.name=c("Dis","Group","Status","Grade","Freq","Date"))

这给了我:

     obs  m   ti td             date class  code variable     Dis   Group Status Grade Freq             Date
1: obs_1 A  grad NA 05/01/2016 15:20    NA 55060        1 DDE0300 2016101     A   5.70   97 05/01/2016 00:00
2: obs_1 A  grad NA 05/01/2016 15:20    NA 55060        2  MS0230 2016101     A   8.19  100 05/01/2016 15:20
3: obs_1 A  grad NA 05/01/2016 15:20    NA 55060        3   A0301 2016101     A   5.80  100 05/01/2016 15:20
4: obs_1 A  grad NA 05/01/2016 15:20    NA 55060        4      NA      NA     NA    NA   NA 27/01/2016 13:12

【讨论】:

  • 不完全是因为对于这种特殊情况,我只在第一行使用了 melt() 。所以我只拿到了DDE0300、MS0230和A0301。我不知道为什么那里有第四行。
  • 是的,似乎第一个 'Date' 正在接 'date',而最后一个 'Date' (27/01/2016 13:12) 是实际的最后一个 'Date',任何修复它的想法?
  • 我想我修好了。因此,在 patterns() 中,如果您将字符串键入为“date”,该函数将拾取具有字符串“date”的每一列并将它们融化。现在,使用“date$”似乎只获取与整个名称匹配的列。话虽这么说,在这种情况下,有 2 列名称为“日期”,但它似乎已修复它。融化(setDT(df[1,]), id=1L, id.vars=c("obs","m","ti","td","date","class","code"), measure=patterns("dis","group","status","grade","freq","date$"), value.name=c("Dis","Group","Status","Grade ","频率","日期"))
猜你喜欢
  • 2020-02-22
  • 1970-01-01
  • 2020-11-29
  • 1970-01-01
  • 2020-09-06
  • 1970-01-01
  • 1970-01-01
  • 2023-02-23
  • 2022-01-08
相关资源
最近更新 更多