【问题标题】:How to fuzzyjoin several dataframes in one go using IRanges如何使用 IRange 一次性模糊连接多个数据帧
【发布时间】:2021-02-28 13:40:21
【问题描述】:

我需要加入几个基于不精确匹配的数据帧,这可以使用fuzzyjoinIRanges 包来实现:

数据:

df1 <- data.frame(
  line = 1:4,
  start = c(75,100,170,240),
  end = c(100,150,190,300)
)
df2 <- data.frame(
  v2 = c("A","B","C","D","E","F","G","H","I","J","K","F"),
  start = c(0,10,30,90,120,130,154,161,175,199,205,300),
  end = c(10,20,50,110,130,140,160,165,180,250,300,305)
)

df3 <- data.frame(
  v3 = c("a","b","c","d","e","f"),
  start = c(5,90,200,333,1000,1500),
  end = c(75,171,210,400,1001,1600)
)

在这里,我想根据startend之间的间隔df2df3加入df1。我可以做的是分步进行,即通过加入加入:

library(fuzzyjoin)

# install package "IRanges":
if (!requireNamespace("BiocManager", quietly = TRUE))
   install.packages("BiocManager")
 
BiocManager::install("IRanges")
library(BiocManager)

# First join:
df12 <- interval_left_join(x = df1,
                            y = df2,
                            by = c("start", "end")) %>%
  select(-c(start.y, end.y)) %>%
  rename(start = start.x, end = end.x)

# Second join:
df123 <- interval_left_join(x = df12,
                             y = df3,
                             by = c("start", "end")) %>%
  select(-c(start.y, end.y)) %>%
  rename(start = start.x, end = end.x)

结果:

df123  
  line start end v2   v3
1    1    75 100  D    a
2    1    75 100  D    b
3    2   100 150  D    b
4    2   100 150  E    b
5    2   100 150  F    b
6    3   170 190  I    b
7    4   240 300  J <NA>
8    4   240 300  K <NA>
9    4   240 300  F <NA>

这一切都很好,但在我的实际数据中,我有多个数据框要加入,然后,逐个加入是不切实际且容易出错的。如何一次性对所有数据帧执行连接?

【问题讨论】:

标签: r fuzzyjoin


【解决方案1】:

将数据框放入列表中,并使用Reduce 连接数据框。

library(fuzzyjoin)
library(dplyr)

join_two_dataframes <- function(df1, df2) {
  interval_left_join(x = df1,
                     y = df2,
                     by = c("start", "end")) %>%
    select(-c(start.y, end.y)) %>%
    rename(start = start.x, end = end.x)
}

list_df <- list(df1, df2, df3)
Reduce(join_two_dataframes, list_df)

#  line start end v2   v3
#1    1    75 100  D    a
#2    1    75 100  D    b
#3    2   100 150  D    b
#4    2   100 150  E    b
#5    2   100 150  F    b
#6    3   170 190  I    b
#7    4   240 300  J <NA>
#8    4   240 300  K <NA>
#9    4   240 300  F <NA>

【讨论】:

  • 超级酷。 Reduce 到底是做什么的?
  • Reduce 将首先将 df1 与 df2 输出连接到 df3 等,以获取list_df 中存在的所有数据帧列表,
猜你喜欢
  • 2019-08-28
  • 2020-04-02
  • 2020-08-26
  • 2019-06-27
  • 2020-07-08
  • 2016-02-18
  • 1970-01-01
  • 2022-10-12
  • 1970-01-01
相关资源
最近更新 更多