通过两列的唯一组合获取最小值答案

【问题标题】：Get minimum grouped by unique combination of two columns通过两列的唯一组合获取最小值
【发布时间】：2015-09-29 10:08:12
【问题描述】：

我要在 R 中实现的目标如下：给定一个表格（在我的情况下为数据框）-我希望获得两个列的每个唯一组合的最低价格。

例如，给定下表：

+-----+-----------+-------+----------+----------+
| Key | Feature1  | Price | Feature2 | Feature3 |
+-----+-----------+-------+----------+----------+
| AAA |         1 |   100 | whatever | whatever |
| AAA |         1 |   150 | whatever | whatever |
| AAA |         1 |   200 | whatever | whatever |
| AAA |         2 |   110 | whatever | whatever |
| AAA |         2 |   120 | whatever | whatever |
| BBB |         1 |   100 | whatever | whatever |
+-----+-----------+-------+----------+----------+

我想要一个看起来像这样的结果：

+-----+-----------+-------+----------+----------+
| Key | Feature1  | Price | Feature2 | Feature3 |
+-----+-----------+-------+----------+----------+
| AAA |         1 |   100 | whatever | whatever |
| AAA |         2 |   110 | whatever | whatever |
| BBB |         1 |   100 | whatever | whatever |
+-----+-----------+-------+----------+----------+

所以我正在研究以下方面的解决方案：

s <- lapply(split(data, list(data$Key, data$Feature1)), function(chunk) { 
        chunk[which.min(chunk$Price),]})

但结果是一个 1 x n 矩阵 - 所以我需要unsplit 结果。另外 - 它似乎很慢。我该如何改进这个逻辑？我已经看到了指向 data.table 包方向的解决方案。我应该使用那个包重写吗？

更新

很好的答案伙计们-谢谢！但是 - 我的原始数据框包含更多列（ Feature2 ... ），过滤后我需要它们全部返回。可以丢弃没有最低价格的行（对于 Key/Feature1 的组合），所以我对它们的 Feature2 / Feature3 值不感兴趣

【问题讨论】：

用什么逻辑取其他列的值？例如，如果 Feature2 对同一个 key-feature1 有不同的值，那么输出中必须包含哪个值？
属于最低价的值。所以这个东西需要作为一个行过滤器。所以AAA-1，AAA-2，BBB-1的“随便”。其余行可以丢弃。

标签： r

【解决方案1】：

由于您提到了data.table 包，我在这里提供使用该包的解决方案：

library(data.table)
setDT(df)[,.(Price=min(Price)),.(Key, Feature1)] #initial question
setDT(df)[,.SD[which.min(Price)],.(Key, Feature1)] #updated question

df 是您的示例 data.frame。

更新：使用mtcars 数据进行测试

df<-mtcars
library(data.table)
setDT(df)[,.SD[which.min(mpg)],by=am]
   am  mpg cyl disp  hp drat   wt  qsec vs gear carb
1:  1 15.0   8  301 335 3.54 3.57 14.60  0    5    8
2:  0 10.4   8  472 205 2.93 5.25 17.98  0    3    4

【讨论】：

【解决方案2】：

您可以使用dplyr 包：

library(dplyr)

data %>% group_by(Key, Feature1) %>%
         slice(which.min(Price))

【讨论】：

伟大的作品 - 但我需要在结果中恢复所有列。我稍微简化了这个例子。实际上，数据包含更多列，这是我在结果中需要的。

【解决方案3】：

使用 R 基础aggregate

> aggregate(Price~Key+Feature1, min, data=data)
  Key Feature1 Price
1 AAA        1   100
2 BBB        1   100
3 AAA        2   110

See this post 用于其他选择。

【讨论】：

【解决方案4】：

基本的 R 解决方案是 aggregate(Price ~ Key + Feature1, data, FUN = min)

【讨论】：

非常优雅 - 但我需要在结果中恢复所有列。我稍微简化了这个例子。实际上，数据包含更多列，这是我在结果中需要的。
您的意思是您希望原始数据框中的最小值返回吗？如果是这种情况，请使用ave(data$Price, data$Key, data$Feature, FUN = min)。
否 - 请参阅更新后的问题 - 我只想要具有最低值的行（对于 Key + Feature1 的唯一组合） - 但具有所有原始值。我试过你的代码，它只返回 3 列：Key、Feature1 和 Price - 但我还需要所有其他原始列。
啊，我明白了。 jeremycg 的dplyr 解决方案看起来不错。 data.table 一个人愿意做同样的事情是setDT(data)[, lapply(.SD, min), by = list(Key, Feature1)]，前提是你做了data <- data.table(data)。
@user227710 我现在看到你的更新了。您的解决方案有效。