【问题标题】:Keep observations based on other columns根据其他列保留观察结果
【发布时间】:2017-08-03 02:49:01
【问题描述】:

这个问题是here的扩展。
如果我的数据有一个名为Remark 的列:

ID    Name    Type    Date          Amount   Remark
1     AAAA    First   2009/7/20     100      Not want
1     AAAA    First   2010/2/3      200      want ya
2     BBBB    First   2015/3/10     250      
2     CCC     Second  2009/2/23     300      good
2     CCC     Second  2010/1/25     400      OK Right123
2     CCC     Third   2015/4/9      500      
2     CCC     Third   2016/6/25     700      Stackoverflow is awesome

Date 为最大值时,我希望我的结果保持不变。
首先,如果我不考虑列Remark,我可以使用max() 得到这个:

dt[,.(Date = max(Date), Amount = sum(Amount)), by = .(ID, Name, Type)]
   ID Name   Type       Date  Amount
1:  1 AAAA  First 2010-02-03     300
2:  2 BBBB  First 2015-03-10     250
3:  2  CCC Second 2010-01-25     700
4:  2  CCC  Third 2016-06-25    1200

但是,我怎样才能保留 Remark。

   ID Name   Type       Date  Amount      Remark
1:  1 AAAA  First 2010-02-03     300      want ya
2:  2 BBBB  First 2015-03-10     250      
3:  2  CCC Second 2010-01-25     700      OK Right123
4:  2  CCC  Third 2016-06-25    1200      Stackoverflow is awesome

这是我的数据:

dt <- fread("
        ID    Name    Type    Date          Amount   Remark
        1     AAAA    First   2009/7/20     100      Not.want
        1     AAAA    First   2010/2/3      200      want.ya
        2     BBBB    First   2015/3/10     250      
        2     CCC     Second  2009/2/23     300      good
        2     CCC     Second  2010/1/25     400      OK.Right123
        2     CCC     Third   2015/4/9      500      
        2     CCC     Third   2016/6/25     700      Stackoverflow.is.awesome
        ")
dt$Date <- as.Date(dt$Date)

【问题讨论】:

  • 请以可重现的格式提供数据。
  • @Frank 我编辑我的问题。
  • 查看stackoverflow.com/questions/5963269/… 我们应该能够在新的 R 会话中复制粘贴您的代码并查看相同的示例数据。我仍然在那里看到非日期...此外,运行 fread 时出现错误。

标签: r duplicates data.table data-manipulation


【解决方案1】:

我们可以使用join

setcolorder(dt[, setdiff(names(dt), "Amount"), with = FALSE][dt[,  .(Date = max(Date), 
                 Amount = sum(Amount)),
       by = .(ID, Name, Type)], on = .(ID, Name, Type, Date)], names(dt))[]
#   ID Name   Type       Date Amount                   Remark
#1:  1 AAAA  First 2010-02-03    300                  want ya
#2:  2 BBBB  First 2015-03-10    250                         
#3:  2  CCC Second 2010-01-25    700              OK Right123
#4:  2  CCC  Third 2016-06-25   1200 Stackoverflow is awesome

或者没有加入

dt1 <- dt[, c(Amount = sum(.SD[["Amount"]]), .SD[which.max(Date), 
  setdiff(names(.SD), "Amount"), with = FALSE]), .(ID, Name, Type)]

setcolorder(dt1, names(dt))
dt1
#   ID Name   Type       Date Amount                   Remark
#1:  1 AAAA  First 2010-02-03    300                  want ya
#2:  2 BBBB  First 2015-03-10    250                         
#3:  2  CCC Second 2010-01-25    700              OK Right123
#4:  2  CCC  Third 2016-06-25   1200 Stackoverflow is awesome

如果有更多的“金额”列是summed

nm1 <- grep("Amount\\d*", names(dt), value = TRUE)
setcolorder(dt[, setdiff(names(dt), nm1), with = FALSE][dt[, c(Date= max(Date),
       lapply(.SD, sum)), by = .(ID, Name, Type), .SDcols = nm1],
      on = .(ID, Name, Type, Date)], names(dt))[]

【讨论】:

  • 如果我有超过 3 列需要求和(AmountAmount1Amount2),我该怎么办?
  • @PeterChen 在这种情况下,在第一个解决方案的第二个链中使用 dt[, c(Date = max(Date), lapply(.SD, sum)), by = .(ID, Name, Type), .SDcols = AmountCols] 并使用 setdiff 更改“金额”列
【解决方案2】:
> df
   ID Name   Type       Date Amount                   Remark
1:  1 AAAA  First 03-02-2010    200                  want ya
2:  2  CCC  Third 09-04-2015    500                         
3:  2 BBBB  First 10-03-2015    250                         
4:  1 AAAA  First 20-07-2009    100                 Not want
5:  2  CCC Second 23-02-2009    300                     good
6:  2  CCC Second 25-01-2010    400              OK Right123
7:  2  CCC  Third 25-06-2016    700 Stackoverflow is awesome

> df2=df[,.(Date = max(Date), Amount = sum(Amount)), by = .(ID, Name, Type)]
> df2
   ID Name   Type       Date Amount
1:  2 BBBB  First 10-03-2015    250
2:  1 AAAA  First 20-07-2009    300
3:  2  CCC Second 25-01-2010    700
4:  2  CCC  Third 25-06-2016   1200


> df[df2,]
   ID Name   Type       Date Amount                   Remark i.ID i.Name i.Type i.Amount
1:  2 BBBB  First 10-03-2015    250                             2   BBBB  First      250
2:  1 AAAA  First 20-07-2009    100                 Not want    1   AAAA  First      300
3:  2  CCC Second 25-01-2010    400              OK Right123    2    CCC Second      700
4:  2  CCC  Third 25-06-2016    700 Stackoverflow is awesome    2    CCC  Third     1200


> df3=df[df2,c("ID","Name","Type","Date","Remark","i.Amount")]
> df3
   ID Name   Type       Date                   Remark i.Amount
1:  2 BBBB  First 10-03-2015                               250
2:  1 AAAA  First 20-07-2009                 Not want      300
3:  2  CCC Second 25-01-2010              OK Right123      700
4:  2  CCC  Third 25-06-2016 Stackoverflow is awesome     1200

【讨论】:

  • 您的回答有问题。不正确。但方法是对的。
猜你喜欢
  • 1970-01-01
  • 1970-01-01
  • 2018-04-26
  • 1970-01-01
  • 1970-01-01
  • 2018-12-31
  • 2022-09-23
  • 1970-01-01
  • 1970-01-01
相关资源
最近更新 更多