【问题标题】：Bind data frames on longer identifiers R在较长的标识符 R 上绑定数据帧
【发布时间】：2014-05-28 03:25:03
【问题描述】：

我有两个数据帧，其中两个帧共有的唯一标识符在观察次数上有所不同。我想从两者中创建一个数据框，如果它们对公共标识符有更多的观察，则从每个帧中获取观察。例如：

f1 <- data.frame(x = c("a", "a", "b", "c", "c", "c"), y = c(1,1,2,3,3,3))
f2 <- data.frame(x = c("a","b", "b", "c", "c"), y = c(4,5,5,6,6))

我希望它根据较长的 x 生成一个合并，这样它会产生：

x y 
a 1
a 1
b 5
b 5
c 3
c 3 
c 3

任何和所有的想法都会很棒。

【问题讨论】：

标签： r merge frame

【解决方案1】：

这是使用split的解决方案

dd<-rbind(cbind(f1, s="f1"), cbind(f2, s="f2"))

keep<-unsplit(lapply(split(dd$s, dd$x), FUN=function(x) {
    y<-table(x)
    x == names(y[which.max(y)])
}), dd$x)

dd <- dd[keep,]

通常我更喜欢在这里使用ave 函数，但是因为我正在将 data.types 从一个因素更改为一个逻辑，所以我基本上复制了ave 使用的想法和使用split。

【讨论】：

【解决方案2】：

`dplyr`解决方案

library(dplyr)

首先我们合并数据：

使用rbind() 并引入一个名为ref 的新变量以了解每个观察的来源：

both <- rbind( f1, f2 )
both$ref <- rep( c( "f1", "f2" ) , c( nrow(f1), nrow(f2) ) )

然后计算观察结果：

创建另一个新变量，其中包含每个 ref 和 x 组合的观察次数：

both_with_counts <- both %>% 
                         group_by( ref ,x ) %>% 
                         mutate( counts = n() )

然后过滤最大计数：

both_with_counts %>% group_by( x ) %>% filter( n==max(n) )

注意：您也可以只选择 x 和 y 列和 select(x,y)...

这给出了：

## Source: local data frame [7 x 4]
## Groups: x
## 
##  x y ref counts
##  1 a 1  f1 2
##  2 a 1  f1 2
##  3 c 3  f1 3
##  4 c 3  f1 3
##  5 c 3  f1 3
##  6 b 5  f2 2
##  7 b 5  f2 2

现在……

what_I_want <- 
  rbind(cbind(f1,ref = "f1"),cbind(f2,ref = "f2")) %>%
  group_by(ref,x) %>% 
  mutate(counts = n()) %>%
  group_by( x ) %>% 
  filter( counts==max(counts) ) %>%
  select( x, y )

因此：

> what_I_want
# Source: local data frame [7 x 2]
# Groups: x
# 
# x y
# 1 a 1
# 2 a 1
# 3 c 3
# 4 c 3
# 5 c 3
# 6 b 5
# 7 b 5

【讨论】：

【解决方案3】：

不是一个优雅的答案，但仍然给出了预期的结果。希望对您有所帮助。

f1table <- data.frame(table(f1$x))
colnames(f1table) <- c("x","freq")
f1new <- merge(f1,f1table)

f2table <- data.frame(table(f2$x))
colnames(f2table) <- c("x","freq")
f2new <- merge(f2,f2table)

table <- rbind(f1table, f2table)
table <- table[with(table, order(x,-freq)), ]
table <- table[!duplicated(table$x), ]

data <-rbind(f1new, f2new)
merge(data, table, by=c("x","freq"))[,c(1,3)]
  x y
1 a 1
2 a 1
3 b 5
4 b 5
5 c 3
6 c 3
7 c 3

【讨论】：

dplyr解决方案

首先我们合并数据：

然后计算观察结果：

然后过滤最大计数：

现在……

`dplyr`解决方案