【问题标题】:Finding index corresponding to the maximum value查找最大值对应的索引
【发布时间】:2017-01-16 01:54:05
【问题描述】:

我发现这个线程Find rows in dataframe with maximum values grouped by values in another column 已经讨论了其中一个解决方案。我正在使用此解决方案以递归方式查找最大数量的行索引。但是,我的解决方案非常难看——非常程序化而不是矢量化。

这是我的虚拟数据:

dput(Data)

structure(list(Order_Year = c(1999, 1999, 1999, 1999, 1999, 1999, 
1999, 2000, 2000, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2002, 
2002, 2002, 2002), Ship_Year = c(1997, 1998, 1999, 2000, 2001, 
2002, NA, 1997, NA, 1997, 1998, 1999, 2000, 2001, 2002, NA, 1997, 
1998, 1999, 2000), Yen = c(202598.2, 0, 0, 0, 0, 0, 2365901.62, 
627206.75998, 531087.43, 122167.02, 143855.55, 0, 0, 0, 0, 53650.389998, 
17708416.3198, 98196.4, 31389, 0), Units = c(37, 1, 8, 5, 8, 
8, 730, 99, 91, 195, 259, 4, 1, 3, 3, 53, 3844, 142, 63, 27)), .Names = c("Order_Year", 
"Ship_Year", "Yen", "Units"), row.names = c(NA, 20L), class = "data.frame")

我想找出给定Order_YearYenUnits 最大的Ship_Year

这是我尝试过的:

a<-do.call("rbind", by(Data, Data$Order_Year, function(x) x[which.max(x$Yen), ]))
rownames(a)<-NULL
a$Yen<-NULL
a$Units<-NULL
#a has Ship_Year for which Yen is max for a given Order_Year
names(a)[2]<-"by.Yen" 
#Now I'd find max year by units
b<-do.call("rbind", by(Data, Data$Order_Year, function(x) x[which.max(x$Units), ]))
rownames(b)<-NULL
b$Yen<-NULL
b$Units<-NULL
#b has Ship_Year for which Units is max for a given Order_Year
names(b)[2]<-"by.Qty"
c<-a %>% left_join(b)

预期的输出是:

c
  Order_Year by.Yen by.Qty
1       1999     NA     NA
2       2000   1997   1997
3       2001   1998   1998
4       2002   1997   1997

虽然我得到了预期的输出,但上面的方法非常笨拙。有没有更好的方法来处理这个问题?

【问题讨论】:

    标签: r


    【解决方案1】:

    which.max 与 dplyr 分组效果很好:

    library(dplyr)
    
    Data %>% group_by(Order_Year) %>% 
        summarise(by.Yen = Ship_Year[which.max(Yen)], 
                  by.Units = Ship_Year[which.max(Units)])
    
    ## # A tibble: 4 × 3
    ##   Order_Year by.Yen by.Units
    ##        <dbl>  <dbl>    <dbl>
    ## 1       1999     NA       NA
    ## 2       2000   1997     1997
    ## 3       2001   1998     1998
    ## 4       2002   1997     1997
    

    【讨论】:

      【解决方案2】:

      使用基础 R

      a1 <- with(df1,
                 by(data    = df1,
                    INDICES = Order_Year, 
                    FUN     = function(x) list(Yen   = x$Ship_Year[which.max(x$Yen)],
                                               Units = x$Ship_Year[which.max(x$Units)])))
      
      do.call("rbind", lapply(a1, function(x) data.frame(x)))
      #       Yen Units
      # 1999   NA    NA
      # 2000 1997  1997
      # 2001 1998  1998
      # 2002 1997  1997
      

      数据:

      df1 <- structure(list(Order_Year = c(1999, 1999, 1999, 1999, 1999, 1999, 1999,
                                           2000, 2000, 2001, 2001, 2001, 2001, 2001,
                                           2001, 2001, 2002, 2002, 2002, 2002),
                            Ship_Year = c(1997, 1998, 1999, 2000, 2001, 2002, NA, 
                                          1997, NA, 1997, 1998, 1999, 2000, 2001, 
                                          2002, NA, 1997, 1998, 1999, 2000),
                            Yen = c(202598.2, 0, 0, 0, 0, 0, 2365901.62, 627206.75998, 
                                    531087.43, 122167.02, 143855.55, 0, 0, 0, 0,
                                    53650.389998, 17708416.3198, 98196.4, 31389, 0), 
                            Units = c(37, 1, 8, 5, 8, 8, 730, 99, 91, 195, 259, 4,
                                      1, 3, 3, 53, 3844, 142, 63, 27)), 
                       .Names = c("Order_Year", "Ship_Year", "Yen", "Units"), 
                       row.names = c(NA, 20L),
                       class = "data.frame")
      

      【讨论】:

      • 编辑后的答案给出了正确的解决方案。以前的聚合函数及其输出不会一直有效,其输出与您的预期结果匹配是巧合。希望这有助于没有任何包
      【解决方案3】:

      我们可以使用data.table。将'data.frame'转换为'data.table'(setDT(Data)),按'Order_Year'分组,我们得到'Yen'最大值的索引,'Units'用match,子集对应的值'Ship_Year' 基于该索引返回汇总输出

      library(data.table)
      setDT(Data)[,.(by.Yen = Ship_Year[match(max(Yen), Yen)],
              by.Units = Ship_Year[match(max(Units), Units)]) , Order_Year]
      #   Order_Year by.Yen by.Units
      #1:       1999     NA       NA
      #2:       2000   1997     1997
      #3:       2001   1998     1998
      #4:       2002   1997     1997
      

      如果有很多列,我们可以在.SDcols中指定感兴趣的列,按'Order_Year'分组,循环遍历Data.table的子集(.SD)得到索引最大值,unlistlist 输出,基于该索引对“Ship_Year”进行子集化,转换为list (as.list) 并将列的名称设置为“by.Yen”和“by。单位'

      setnames(setDT(Data)[,  as.list(Ship_Year[unlist(lapply(.SD, 
        which.max))]), Order_Year, .SDcols = c("Yen", "Units")], 
                      2:3, c("by.Yen", "by.Units"))[]
      #    Order_Year by.Yen by.Units
      #1:       1999     NA       NA
      #2:       2000   1997     1997
      #3:       2001   1998     1998
      #4:       2002   1997     1997
      

      【讨论】:

      • @akrun- 非常感谢您的帮助。你介意解释一下步骤吗?我尝试执行您的代码,但无法真正理解。
      • @watchtower 我更新了解释。希望对你有帮助
      • 感谢您的帮助。我想用 Alistaire 的答案作为答案,因为它很简单。希望你能理解。
      猜你喜欢
      • 1970-01-01
      • 2020-10-14
      • 2013-06-12
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2017-02-23
      • 2012-07-16
      • 1970-01-01
      相关资源
      最近更新 更多