基于另一个列条件在R中创建数据表的子集答案

【问题标题】：Creating Subset of data table in R based on another column condition基于另一个列条件在R中创建数据表的子集
【发布时间】：2018-04-05 06:05:51
【问题描述】：

我想在 R 中创建以下 candyData 的子集，这样我应该按品牌对数据进行分组，对于每个唯一的品牌，我应该找到并打印 A 和 B 的最大值。为了说明新数据Brand value Nestle 应该出现两次，对应的 Candy Value A 和 B 都出现一次，对应于 Nestle 及其最大值在第三列中，对于所有品牌值也是如此。谢谢，请帮忙。

candyData <- read.table(
text = "
Brand       Candy           value
Nestle      A               12
Nestle      B               34
Nestle      A               32
Hershey's   A               55
Hershey's   B               14
Hershey's   B               19
Mars        B               24
Nestle      B               26
Nestle      A               28
Hershey's   B               23
Hershey's   B               23
Hershey's   A               65
Mars        A               23
Mars        B               34",
header = TRUE,
stringsAsFactors = FALSE)

【问题讨论】：

你可以试试这个：candyData %>% dplyr::group_by(Brand, Candy) %>% dplyr::summarise(maxValue = max(value))

标签： r dplyr data.table plyr

【解决方案1】：

试试这个：

library(dplyr)
candyData %>% 
  group_by(Brand, Candy) %>% 
  summarise(max=max(value))

输出将是：

# A tibble: 6 x 3
# Groups:   Brand [?]
  Brand     Candy   max
  <chr>     <chr> <dbl>
1 Hershey's A       65.
2 Hershey's B       23.
3 Mars      A       23.
4 Mars      B       34.
5 Nestle    A       32.
6 Nestle    B       34.

【讨论】：

【解决方案2】：

aggregate(value ~ ., candyData, max)

这将candyData 与Brand 和Candy 分组（因为它们都是除value 之外的列；. 这样做）并为每个组提供max 的value。

【讨论】：

完美答案。绝对值得学习aggregate。谢谢@Jordi
这是我建议的方法 (+1)

【解决方案3】：

再提供几个解决方案：

cd <- read.table(
    text = "
    Brand       Candy           value
    Nestle      A               12
    Nestle      B               34
    Nestle      A               32
    Hershey's   A               55
    Hershey's   B               14
    Hershey's   B               19
    Mars        B               24
    Nestle      B               26
    Nestle      A               28
    Hershey's   B               23
    Hershey's   B               23
    Hershey's   A               65
    Mars        A               23
    Mars        B               34",
    header = TRUE,
    stringsAsFactors = FALSE)

#using split + lapply or equivalently, by
c(by(cd$value, paste(cd$Brand, cd$Candy), max))

#using tapply i.e. apply to each group
tapply(cd$value, paste(cd$Brand, cd$Candy), max)

#using data.table
library(data.table)
setDT(cd)[, .(Max=max(value)), by=.(Brand, Candy)]

#using sqldf
library(sqldf)
sqldf("select Brand, Candy, max(value) as Max from cd group by Brand, Candy")

【讨论】：

【解决方案4】：

虽然我的答案远没有使用 dplyr 的答案优雅，但我使用 base R 创建了一个解决方案。

splittedData <- split(candyData,candyData$Brand)
resultDf <- data.frame(matrix(ncol = 3))
colnames(resultDf) <- c("Brand", "Candy", "maxValue")
insertIndex<-1
for(dfIndex in 1:length(splittedData)) {
  tempDf <- splittedData[[dfIndex]]
  tableDf <- data.frame(table(tempDf$Candy))
  tableDf[,1] <- as.character(tableDf[,1])
  for(i in 1:nrow(tableDf)) {
    resultDf[insertIndex, 1] <- tempDf$Brand[1]
    resultDf[insertIndex, 2] <- tableDf[i,1]
    resultDf[insertIndex, 3] <- max(tempDf$value[tempDf$Candy==tableDf[i,1]])
    insertIndex <- insertIndex + 1
  }
}

输出是一个新的df：

  Brand     Candy maxValue
1 Hershey's     A       65
2 Hershey's     B       23
3      Mars     A       23
4      Mars     B       34
5    Nestle     A       32
6    Nestle     B       34

【讨论】：

【解决方案5】：

使用提供的示例数据和data.table：

library(data.table)
setDT(candyData)
candyData[,.(Max = max(value)), keyby = .(Brand,Candy)]

给予

       Brand Candy Max
1: Hershey's     A  65
2: Hershey's     B  23
3:      Mars     A  23
4:      Mars     B  34
5:    Nestle     A  32
6:    Nestle     B  34

【讨论】：