INNER JOIN MAX 条件类型答案

【问题标题】：INNER JOIN MAX condition typeINNER JOIN MAX 条件类型
【发布时间】：2016-01-15 23:29:29
【问题描述】：

我有两个数据框：

info
Fname  Lname
Henry      H
 Rose      R
Jacob      T
 John      O
 Fred      Y
Simon      S
  Gay      T

和

students
Fname  Lname  Age  Height  Subject Result
Henry      H   12      15 Math;Sci      P
 Rose      R   11      18 Math;Sci      P
Jacob      T   11      15 Math;Sci      P
Henry      H   11      14 Math;Sci      P
 John      O   12      13 Math;Sci      P
 John      O   13      16 Math;Sci      F
 Fred      Y   11      16      Sci      P
Simon      S   12      10 Eng;Math      P
  Gay      T   12      11 Math;Sci      F
 Rose      R   15      18 Math;Sci      P
 Fred      Y   12      16 Math;Sci      P

我想做一个 JOIN 并从 info 中获取所有名称，并从学生那里找到其相关元数据。但只选择 最高年龄 的那个（当 Fname 和 LName 相等时）。我的输出应该是这样的：

Final
Fname Lname Age Height  Subject Result
Henry     H  12     15 Math;Sci      P
 Rose     R  15     18 Math;Sci      P
Jacob     T  11     15 Math;Sci      P
 John     O  13     16 Math;Sci      F
 Fred     Y  12     16 Math;Sci      P
Simon     S  12     10 Eng;Math      P
  Gay     T  12     11 Math;Sci      F

我试过sqldf，但还没有运气。我只是无法正确获取标识符。有没有其他方法可以获得我的输出？

【问题讨论】：

How to join (merge) data frames (inner, outer, left, right)?的可能重复

标签： r inner-join

【解决方案1】：

这是一种可能不太优雅的方式，使用基址R。

现在，合并名称上的框架（尽管在此示例中这样做没有什么意义；它实际上只是students 框架中已经存在的名称列表）。

merged_df <- merge(students,info,by=c("Fname","Lname"))

最后，聚合，这里只是名称。您可以添加任何分类或因子变量。

merged_df_max <-aggregate(
                merged_df[c('Age','Height')], 
                by=list(Fname = merged_df$Fname,
                        Lname = merged_df$Lname), 
                FUN=max, na.rm=TRUE)

## add back details to the merged df
result <- merge(merged_df_max,students,by=c("Fname","Lname","Age","Height"))

从文件创建data.frame，

## load data
lines <-"
Fname,Lname,Age,Height,Subject,Result
Henry,H,12,15,Math;Sci,P
Rose,R,11,18,Math;Sci,P
Jacob,T,11,15,Math;Sci,P
Henry,H,11,14,Math;Sci,P
John,O,12,13,Math;Sci,P
John,O,13,16,Math;Sci,F
Fred,Y,11,16,Sci,P
Simon,S,12,10,Eng;Math,P
Gay,T,12,11,Math;Sci,F
Rose,R,15,18,Math;Sci,P
Fred,Y,12,16,Math;Sci,P
"

lines2 <-"
Fname,Lname
Henry,H
Rose,R
Jacob,T
John,O
Fred,Y
Simon,S
Gay,T
"

con <- textConnection(lines)
students <- read.csv(con,sep=',')
con2 <- textConnection(lines2)
info <- read.csv(con2,sep=',')
close(con)
close(con2)

【讨论】：

请不要使用attach。
@Pascal 抱歉！不会再发生了。
感谢您的帮助。

【解决方案2】：

试试这个：

library(sqldf)
sqldf("select Fname, Lname, max(Age) Age, Height, Subject, Result 
       from info left join students using (Fname, Lname)
       group by Fname, Lname")

如果info 中有学生而students 中没有数据，我们会使用左连接。在问题中info 和students 中的学生是相同的，因此我们可以在查询中省略单词left 并且仍然得到相同的结果。另请注意，由于完全相同的一组学生出现在info 和students 中，我们根本不需要使用info。除了from 行之外，这与最后一个查询相同，但提供的数据给出了相同的答案：

sqldf("select Fname, Lname, max(Age) Age, Height, Subject, Result 
       from students
       group by Fname, Lname")

注意：为了重现性，以下构造info 和student 数据帧。请在将来就 SO 提问时自行提供。

Lines_info <- "
Fname  Lname
Henry      H
 Rose      R
Jacob      T
 John      O
 Fred      Y
Simon      S
  Gay      T
"
Lines_students <- "
Fname  Lname  Age  Height  Subject Result
Henry      H   12      15 Math;Sci      P
 Rose      R   11      18 Math;Sci      P
Jacob      T   11      15 Math;Sci      P
Henry      H   11      14 Math;Sci      P
 John      O   12      13 Math;Sci      P
 John      O   13      16 Math;Sci      F
 Fred      Y   11      16      Sci      P
Simon      S   12      10 Eng;Math      P
  Gay      T   12      11 Math;Sci      F
 Rose      R   15      18 Math;Sci      P
 Fred      Y   12      16 Math;Sci      P
"

info <- read.table(text = Lines_info, header = TRUE)
students <- read.table(text = Lines_students, header = TRUE)

【讨论】：

非常感谢。我会记住你的建议。

【解决方案3】：

使用dplyr：

library(dplyr)

info %>% left_join(students) %>%
    group_by(Fname, Lname) %>%
    filter(Age == max(Age))

【讨论】：