【问题标题】:merge dataframes based on multiple columns and thresholds基于多列和阈值合并数据框
【发布时间】:2019-11-05 16:49:35
【问题描述】:

我有两个 data.frames 和多个公共列(这里:datecityctry 和 (other_)number)。

我现在想将它们合并到上述列中,但可以容忍一定程度的差异:

threshold.numbers <- 3
threshold.date <- 5  # in days

如果 date 条目之间的差异是 &gt; threshold.date(以天为单位) &gt; threshold.numbers,我不希望合并这些行。 同样,如果city 中的条目是city 列中另一个df 条目的子字符串,我希望合并这些行。 [如果有人有更好的想法来测试实际城市名称的相似性,我很乐意听到。](并保留第一个 dfdatecitycountry 条目但 (other_)number 列和 df 中的所有其他列。

考虑以下示例:

df1 <- data.frame(date = c("2003-08-29", "1999-06-12", "2000-08-29", "1999-02-24", "2001-04-17",
                           "1999-06-30", "1999-03-16", "1999-07-16", "2001-08-29", "2002-07-30"),
                  city = c("Berlin", "Paris", "London", "Rome", "Bern",
                           "Copenhagen", "Warsaw", "Moscow", "Tunis", "Vienna"),
                  ctry = c("Germany", "France", "UK", "Italy", "Switzerland",
                           "Denmark", "Poland", "Russia", "Tunisia", "Austria"),
                  number = c(10, 20, 30, 40, 50, 60, 70, 80, 90, 100),
                  col = c("apple", "banana", "pear", "banana", "lemon", "cucumber", "apple", "peach", "cherry", "cherry"))


df2 <- data.frame(date = c("2003-08-29", "1999-06-12", "2000-08-29", "1999-02-24", "2001-04-17", # all identical to df1
                           "1999-06-29", "1999-03-14", "1999-07-17", # all 1-2 days different
                           "2000-01-29", "2002-07-01"), # all very different (> 2 weeks)
                  city = c("Berlin", "East-Paris", "near London", "Rome", # same or slight differences
                           "Zurich", # completely different
                           "Copenhagen", "Warsaw", "Moscow", "Tunis", "Vienna"), # same
                  ctry = c("Germany", "France", "UK", "Italy", "Switzerland", # all the same 
                           "Denmark", "Poland", "Russia", "Tunisia", "Austria"),
                  other_number = c(13, 17, 3100, 45, 51, 61, 780, 85, 90, 101), # slightly different to very different
                  other_col = c("yellow", "green", "blue", "red", "purple", "orange", "blue", "red", "black", "beige"))

现在,我想合并data.frames 并接收df,如果满足上述条件,则合并行。

(第一列只是为了方便:在第一个数字后面,表示原始大小写,它显示了合并的行(.)或行是否来自df11)或df2 (2)。

          date        city        ctry number other_col other_number    other_col2          #comment
 1.  2003-08-29      Berlin     Germany     10     apple              13        yellow      # matched on date, city, number
 2.  1999-06-12       Paris      France     20    banana              17         green      # matched on date, city similar, number - other_number == threshold.numbers
 31  2000-08-29      London          UK     30      pear            <NA>          <NA>      # not matched: number - other_number > threshold.numbers
 32  2000-08-29 near London         UK    <NA>      <NA>            3100          blue      #
 41  1999-02-24        Rome       Italy     40    banana            <NA>          <NA>      # not matched: number - other_number > threshold.numbers
 42  1999-02-24        Rome       Italy   <NA>      <NA>              45           red      #
 51  2001-04-17        Bern Switzerland     50     lemon            <NA>          <NA>      # not matched: cities different (dates okay, numbers okay)
 52  2001-04-17      Zurich Switzerland   <NA>      <NA>              51        purple      #
 6.  1999-06-30  Copenhagen     Denmark     60  cucumber              61        orange      # matched: date difference < threshold.date (cities okay, dates okay)
 71  1999-03-16      Warsaw      Poland     70     apple            <NA>          <NA>      # not matched: number - other_number > threshold.numbers (dates okay)
 72  1999-03-14      Warsaw      Poland   <NA>      <NA>             780          blue      # 
 81  1999-07-16      Moscow      Russia     80     peach            <NA>          <NA>      # not matched: number - other_number > threshold.numbers (dates okay)
 82  1999-07-17      Moscow      Russia   <NA>      <NA>              85           red      #
 91  2001-08-29       Tunis     Tunisia     90    cherry            <NA>          <NA>      # not matched: date difference < threshold.date (cities okay, dates okay)
 92  2000-01-29       Tunis     Tunisia   <NA>      <NA>              90         black      #
101  2002-07-30      Vienna     Austria    100    cherry            <NA>          <NA>      # not matched: date difference < threshold.date (cities okay, dates okay)
102  2002-07-01      Vienna     Austria   <NA>      <NA>             101         beige      #

我尝试了不同的合并方式,但无法实现阈值。

编辑 对不明确的表述表示歉意 - 我想保留所有行并接收指示该行是否匹配、不匹配和来自 df1 或不匹配和来自 df2。

伪代码为:

  if there is a case where abs("date_df2" - "date_df1") <= threshold.date:
    if "ctry_df2" == "ctry_df1":
      if "city_df2" ~ "city_df1":
        if abs("number_df2" - "number_df1") <= threshold.numbers:
          merge and go to next row in df2
  else:
    add row to df1```

【问题讨论】:

  • 这是您打印的最后一个数据帧,是您想要获得的输出吗?即最后应该有 17 行?或者只是用.标记的3?
  • 我实际上希望保留所有行,但如果它们匹配,则带有一个指示符。抱歉,如果不清楚;我相应地编辑了问题。
  • 所以这意味着你想要 10 行和原来一样?
  • 我添加了伪代码以使其更清晰;这有帮助吗?
  • 如果 data.frame 不是您唯一的选择,我强烈建议使用 data.table

标签: r dataframe


【解决方案1】:

我首先将城市名称转换为字符向量,因为(如果我理解正确的话)您希望包含包含在 df2 中的城市名称。

df1$city<-as.character(df1$city)
df2$city<-as.character(df2$city)

然后按国家合并:

df = merge(df1, df2, by = ("ctry"))

> df
          ctry     date.x     city.x number      col     date.y      city.y other_number other_col
1      Austria 2002-07-30     Vienna    100   cherry 2002-07-01      Vienna          101     beige
2      Denmark 1999-06-30 Copenhagen     60 cucumber 1999-06-29  Copenhagen           61    orange
3       France 1999-06-12      Paris     20   banana 1999-06-12  East-Paris           17     green
4      Germany 2003-08-29     Berlin     10    apple 2003-08-29      Berlin           13    yellow
5        Italy 1999-02-24       Rome     40   banana 1999-02-24        Rome           45       red
6       Poland 1999-03-16     Warsaw     70    apple 1999-03-14      Warsaw          780      blue
7       Russia 1999-07-16     Moscow     80    peach 1999-07-17      Moscow           85       red
8  Switzerland 2001-04-17       Bern     50    lemon 2001-04-17      Zurich           51    purple
9      Tunisia 2001-08-29      Tunis     90   cherry 2000-01-29       Tunis           90     black
10          UK 2000-08-29     London     30     pear 2000-08-29 near London         3100      blue

stringr 库将允许您在此处查看 city.x 是否在 city.y 内(请参阅最后一列):

library(stringr)
df$city_keep<-str_detect(df$city.y,df$city.x) # this returns logical vector if city.x is contained in city.y (works one way)
> df
          ctry     date.x     city.x number      col     date.y      city.y other_number other_col city_keep
1      Austria 2002-07-30     Vienna    100   cherry 2002-07-01      Vienna          101     beige      TRUE
2      Denmark 1999-06-30 Copenhagen     60 cucumber 1999-06-29  Copenhagen           61    orange      TRUE
3       France 1999-06-12      Paris     20   banana 1999-06-12  East-Paris           17     green      TRUE
4      Germany 2003-08-29     Berlin     10    apple 2003-08-29      Berlin           13    yellow      TRUE
5        Italy 1999-02-24       Rome     40   banana 1999-02-24        Rome           45       red      TRUE
6       Poland 1999-03-16     Warsaw     70    apple 1999-03-14      Warsaw          780      blue      TRUE
7       Russia 1999-07-16     Moscow     80    peach 1999-07-17      Moscow           85       red      TRUE
8  Switzerland 2001-04-17       Bern     50    lemon 2001-04-17      Zurich           51    purple     FALSE
9      Tunisia 2001-08-29      Tunis     90   cherry 2000-01-29       Tunis           90     black      TRUE
10          UK 2000-08-29     London     30     pear 2000-08-29 near London         3100      blue      TRUE

然后你可以得到日期之间的天数差异:

df$dayDiff<-abs(as.POSIXlt(df$date.x)$yday - as.POSIXlt(df$date.y)$yday)

以及数字上的差异:

df$numDiff<-abs(df$number - df$other_number)

生成的数据框如下所示:

> df
          ctry     date.x     city.x number      col     date.y      city.y other_number other_col city_keep dayDiff numDiff
1      Austria 2002-07-30     Vienna    100   cherry 2002-07-01      Vienna          101     beige      TRUE      29       1
2      Denmark 1999-06-30 Copenhagen     60 cucumber 1999-06-29  Copenhagen           61    orange      TRUE       1       1
3       France 1999-06-12      Paris     20   banana 1999-06-12  East-Paris           17     green      TRUE       0       3
4      Germany 2003-08-29     Berlin     10    apple 2003-08-29      Berlin           13    yellow      TRUE       0       3
5        Italy 1999-02-24       Rome     40   banana 1999-02-24        Rome           45       red      TRUE       0       5
6       Poland 1999-03-16     Warsaw     70    apple 1999-03-14      Warsaw          780      blue      TRUE       2     710
7       Russia 1999-07-16     Moscow     80    peach 1999-07-17      Moscow           85       red      TRUE       1       5
8  Switzerland 2001-04-17       Bern     50    lemon 2001-04-17      Zurich           51    purple     FALSE       0       1
9      Tunisia 2001-08-29      Tunis     90   cherry 2000-01-29       Tunis           90     black      TRUE     212       0
10          UK 2000-08-29     London     30     pear 2000-08-29 near London         3100      blue      TRUE       0    3070

但是我们想删除在 city.y 中没有找到 city.x 的东西,其中日差大于 5 或​​数字差大于 3:

df<-df[df$dayDiff<=5 & df$numDiff<=3 & df$city_keep==TRUE,]

> df
     ctry     date.x     city.x number      col     date.y     city.y other_number other_col city_keep dayDiff numDiff
2 Denmark 1999-06-30 Copenhagen     60 cucumber 1999-06-29 Copenhagen           61    orange      TRUE       1       1
3  France 1999-06-12      Paris     20   banana 1999-06-12 East-Paris           17     green      TRUE       0       3
4 Germany 2003-08-29     Berlin     10    apple 2003-08-29     Berlin           13    yellow      TRUE       0       3

剩下的是上面的三行(第 1 列中包含点)。

现在我们可以从 df2 中删除我们创建的三列以及日期和城市:

> df<-subset(df, select=-c(city.y, date.y, city_keep, dayDiff, numDiff))
> df
     ctry     date.x     city.x number      col other_number other_col
2 Denmark 1999-06-30 Copenhagen     60 cucumber           61    orange
3  France 1999-06-12      Paris     20   banana           17     green
4 Germany 2003-08-29     Berlin     10    apple           13    yellow

【讨论】:

    【解决方案2】:

    第一步:根据“city”和“ctry”合并数据:

    df = merge(df1, df2, by = c("city", "ctry"))
    

    第 2 步:如果日期条目之间的差异 > threshold.date(以天为单位),则删除行:

    date_diff = abs(as.numeric(difftime(strptime(df$date.x, format = "%Y-%m-%d"),
                                        strptime(df$date.y, format = "%Y-%m-%d"), units="days")))
    index_remove = date_diff > threshold.date
    df = df[-index_remove,]
    

    第 3 步:如果数字之间的差为 > threshhold.number,则删除行:

    number_diff = abs(df$number - df$other_number) 
    index_remove = number_diff > threshold.numbers
    df = df[-index_remove,]
    

    在应用条件之前应该合并数据,以防行不匹配。

    【讨论】:

      【解决方案3】:

      使用data.table 的选项(内联解释):

      library(data.table)
      setDT(df1)
      setDT(df2)
      
      #dupe columns and create ranges for non-equi joins
      df1[, c("n", "ln", "un", "d", "ld", "ud") := .(
          number, number - threshold.numbers, number + threshold.numbers,
          date, date - threshold.date, date + threshold.date)]
      df2[, c("n", "ln", "un", "d", "ld", "ud") := .(
          other_number, other_number - threshold.numbers, other_number + threshold.numbers,
          date, date - threshold.date, date + threshold.date)]
      
      #perform non-equi join using ctry, num, dates in both ways
      res <- rbindlist(list(
          df1[df2, on=.(ctry, n>=ln, n<=un, d>=ld, d<=ud),
              .(date1=x.date, date2=i.date, city1=x.city, city2=i.city, ctry1=x.ctry, ctry2=i.ctry, number, col, other_number, other_col)],
          df2[df1, on=.(ctry, n>=ln, n<=un, d>=ld, d<=ud),
              .(date1=i.date, date2=x.date, city1=i.city, city2=x.city, ctry1=i.ctry, ctry2=x.ctry, number, col, other_number, other_col)]),
          use.names=TRUE, fill=TRUE)
      
      #determine if cities are substrings of one and another
      res[, city_match := {
          i <- mapply(grepl, city1, city2) | mapply(grepl, city2, city1)
          replace(i, is.na(i), TRUE)
      }]
      
      #just like SQL coalesce (there is a version in dev in rdatatable github)
      coalesce <- function(...) Reduce(function(x, y) fifelse(!is.na(y), y, x), list(...))
      
      #for rows that are matching or no matches to be found
      ans1 <- unique(res[(city_match), .(date=coalesce(date1, date2),
          city=coalesce(city1, city2),
          ctry=coalesce(ctry1, ctry2),
          number, col, other_number, other_col)])
      
      #for rows that are close in terms of dates and numbers but are diff cities
      ans2 <- res[(!city_match), .(date=c(.BY$date1, .BY$date2),
              city=c(.BY$city1, .BY$city2),
              ctry=c(.BY$ctry1, .BY$ctry2),
              number=c(.BY$number, NA),
              col=c(.BY$col, NA),
              other_number=c(NA, .BY$other_number),
              other_col=c(NA, .BY$other_col)),
          names(res)][, seq_along(names(res)) := NULL]
      
      #final desired output
      setorder(rbindlist(list(ans1, ans2)), date, city, number, na.last=TRUE)[]
      

      输出:

                date        city        ctry number      col other_number other_col
       1: 1999-02-24        Rome       Italy     40   banana           NA      <NA>
       2: 1999-02-24        Rome       Italy     NA     <NA>           45       red
       3: 1999-03-14      Warsaw      Poland     NA     <NA>          780      blue
       4: 1999-03-16      Warsaw      Poland     70    apple           NA      <NA>
       5: 1999-06-12  East-Paris      France     20   banana           17     green
       6: 1999-06-29  Copenhagen     Denmark     60 cucumber           61    orange
       7: 1999-07-16      Moscow      Russia     80    peach           NA      <NA>
       8: 1999-07-17      Moscow      Russia     NA     <NA>           85       red
       9: 2000-01-29       Tunis     Tunisia     NA     <NA>           90     black
      10: 2000-08-29      London          UK     30     pear           NA      <NA>
      11: 2000-08-29 near London          UK     NA     <NA>         3100      blue
      12: 2001-04-17        Bern Switzerland     50    lemon           NA      <NA>
      13: 2001-04-17      Zurich Switzerland     NA     <NA>           51    purple
      14: 2001-08-29       Tunis     Tunisia     90   cherry           NA      <NA>
      15: 2002-07-01      Vienna     Austria     NA     <NA>          101     beige
      16: 2002-07-30      Vienna     Austria    100   cherry           NA      <NA>
      17: 2003-08-29      Berlin     Germany     10    apple           13    yellow
      

      【讨论】:

        【解决方案4】:

        您可以使用greplctry 简单地使用== 测试city 匹配。对于那些匹配到这里的人,您可以通过使用as.Date 转换为date 并将其与difftime 进行比较来计算日期差异。 number 的区别也是一样的。

        i1 <- seq_len(nrow(df1)) #Store all rows 
        i2 <- seq_len(nrow(df2))
        res <- do.call(rbind, sapply(seq_len(nrow(df1)), function(i) { #Loop over all rows in df1
          t1 <- which(df1$ctry[i] == df2$ctry) #Match ctry
          t2 <- grepl(df1$city[i], df2$city[t1]) | sapply(df2$city[t1], grepl, df1$city[i]) #Match city
          t1 <- t1[t2 & abs(as.Date(df1$date[i]) - as.Date(df2$date[t1[t2]])) <=
            as.difftime(threshold.date, units = "days") & #Test for date difference
            abs(df1$number[i] - df2$other_number[t1[t2]]) <= threshold.numbers] #Test for number difference
          if(length(t1) > 0) { #Match found
            i1 <<- i1[i1!=i] #Remove row as it was found
            i2 <<- i2[i2!=t1]
            cbind(df1[i,], df2[t1,c("other_number","other_col")], match=".") 
          }
        }))
        rbind(res
            , cbind(df1[i1,], other_number=NA, other_col=NA, match="1")
            , cbind(df2[i2,1:3], number=NA, col=NA, other_number=df2[i2,4]
                    , other_col=df2[i2,5], match="2"))
        #          date        city        ctry number      col other_number other_col match
        #1   2003-08-29      Berlin     Germany     10    apple           13    yellow     .
        #2   1999-06-12       Paris      France     20   banana           17     green     .
        #6   1999-06-30  Copenhagen     Denmark     60 cucumber           61    orange     .
        #3   2000-08-29      London          UK     30     pear           NA      <NA>     1
        #4   1999-02-24        Rome       Italy     40   banana           NA      <NA>     1
        #5   2001-04-17        Bern Switzerland     50    lemon           NA      <NA>     1
        #7   1999-03-16      Warsaw      Poland     70    apple           NA      <NA>     1
        #8   1999-07-16      Moscow      Russia     80    peach           NA      <NA>     1
        #9   2001-08-29       Tunis     Tunisia     90   cherry           NA      <NA>     1
        #10  2002-07-30      Vienna     Austria    100   cherry           NA      <NA>     1
        #31  2000-08-29 near London          UK     NA     <NA>         3100      blue     2
        #41  1999-02-24        Rome       Italy     NA     <NA>           45       red     2
        #51  2001-04-17      Zurich Switzerland     NA     <NA>           51    purple     2
        #71  1999-03-14      Warsaw      Poland     NA     <NA>          780      blue     2
        #81  1999-07-17      Moscow      Russia     NA     <NA>           85       red     2
        #91  2000-01-29       Tunis     Tunisia     NA     <NA>           90     black     2
        #101 2002-07-01      Vienna     Austria     NA     <NA>          101     beige     2
        

        【讨论】:

          【解决方案5】:

          我们可以使用 {powerjoin} :

          library(powerjoin)
          
          power_full_join(
            df1, 
            df2, 
            by = ~ 
                # join if one city name contains the other
              (mapply(grepl, .x$city, .y$city) | mapply(grepl, .y$city, .x$city)) &
                # and dates are close enough
                abs(difftime(.x$date, .y$date, units = "days")) <= threshold.date &
                # and numbers are close enough
                abs(.x$number - .y$other_number) <= threshold.numbers,
            conflict = dplyr::coalesce)
          
          #>    number      col other_number other_col       date        city        ctry
          #> 1      10    apple           13    yellow 2003-08-29      Berlin     Germany
          #> 2      20   banana           17     green 1999-06-12       Paris      France
          #> 3      60 cucumber           61    orange 1999-06-30  Copenhagen     Denmark
          #> 4      30     pear           NA      <NA> 2000-08-29      London          UK
          #> 5      40   banana           NA      <NA> 1999-02-24        Rome       Italy
          #> 6      50    lemon           NA      <NA> 2001-04-17        Bern Switzerland
          #> 7      70    apple           NA      <NA> 1999-03-16      Warsaw      Poland
          #> 8      80    peach           NA      <NA> 1999-07-16      Moscow      Russia
          #> 9      90   cherry           NA      <NA> 2001-08-29       Tunis     Tunisia
          #> 10    100   cherry           NA      <NA> 2002-07-30      Vienna     Austria
          #> 11     NA     <NA>         3100      blue 2000-08-29 near London          UK
          #> 12     NA     <NA>           45       red 1999-02-24        Rome       Italy
          #> 13     NA     <NA>           51    purple 2001-04-17      Zurich Switzerland
          #> 14     NA     <NA>          780      blue 1999-03-14      Warsaw      Poland
          #> 15     NA     <NA>           85       red 1999-07-17      Moscow      Russia
          #> 16     NA     <NA>           90     black 2000-01-29       Tunis     Tunisia
          #> 17     NA     <NA>          101     beige 2002-07-01      Vienna     Austria
          

          reprex package 创建于 2022-04-14 (v2.0.1)

          【讨论】:

            【解决方案6】:

            这是一种灵活的方法,可让您指定您选择的任何合并条件集合。

            准备工作

            我确保 df1df2 中的所有字符串都是字符串,而不是因素(如其他几个答案中所述)。我还将日期包裹在 as.Date 中以使其成为真实日期。

            指定合并条件

            创建一个列表列表。主列表的每个元素都是一个标准;标准的成员是

            • final.col.name:我们想要在决赛桌的列名
            • col.name.1df1中的列名
            • col.name.2df2中的列名
            • exact: 布尔值;我们应该对此列进行精确匹配吗?
            • threshold:阈值(如果我们不进行精确匹配)
            • match.function:返回行是否匹配的函数(对于特殊情况,例如使用grepl 进行字符串匹配;注意此函数必须进行矢量化)
            merge.criteria = list(
              list(final.col.name = "date",
                   col.name.1 = "date",
                   col.name.2 = "date",
                   exact = F,
                   threshold = 5),
              list(final.col.name = "city",
                   col.name.1 = "city",
                   col.name.2 = "city",
                   exact = F,
                   match.function = function(x, y) {
                     return(mapply(grepl, x, y) |
                              mapply(grepl, y, x))
                   }),
              list(final.col.name = "ctry",
                   col.name.1 = "ctry",
                   col.name.2 = "ctry",
                   exact = T),
              list(final.col.name = "number",
                   col.name.1 = "number",
                   col.name.2 = "other_number",
                   exact = F,
                   threshold = 3)
            )
            

            合并函数

            这个函数接受三个参数:我们要合并的两个数据框,以及匹配条件列表。其过程如下:

            1. 遍历匹配条件并确定哪些行对满足或不满足所有条件。 (受@GKi 回答的启发,它使用行索引而不是进行完全外连接,这对于大型数据集可能会占用较少的内存。)
            2. 创建一个只包含我们想要的行的骨架数据框(匹配的情况下合并行,不匹配记录的未合并行)。
            3. 遍历原始数据框的列,并使用它们在新数据框中填充所需的列。 (首先对匹配条件中出现的列执行此操作,然后对剩余的任何其他列执行此操作。)
            library(dplyr)
            merge.data.frames = function(df1, df2, merge.criteria) {
              # Create a data frame with all possible pairs of rows from df1 and rows from
              # df2.
              row.decisions = expand.grid(df1.row = 1:nrow(df1), df2.row = 1:nrow(df2))
              # Iterate over the criteria in merge.criteria.  For each criterion, flag row
              # pairs that don't meet the criterion.
              row.decisions$merge = T
              for(criterion in merge.criteria) {
                # If we're looking for an exact match, test for equality.
                if(criterion$exact) {
                  row.decisions$merge = row.decisions$merge &
                    df1[row.decisions$df1.row,criterion$col.name.1] == df2[row.decisions$df2.row,criterion$col.name.2]
                }
                # If we're doing a threshhold test, test for difference.
                else if(!is.null(criterion$threshold)) {
                  row.decisions$merge = row.decisions$merge &
                    abs(df1[row.decisions$df1.row,criterion$col.name.1] - df2[row.decisions$df2.row,criterion$col.name.2]) <= criterion$threshold
                }
                # If the user provided a function, use that.
                else if(!is.null(criterion$match.function)) {
                  row.decisions$merge = row.decisions$merge &
                    criterion$match.function(df1[row.decisions$df1.row,criterion$col.name.1],
                                             df2[row.decisions$df2.row,criterion$col.name.2])
                }
              }
              # Create the new dataframe.  Just row numbers of the source dfs to start.
              new.df = bind_rows(
                # Merged rows.
                row.decisions %>% filter(merge) %>% select(-merge),
                # Rows from df1 only.
                row.decisions %>% group_by(df1.row) %>% summarize(matches = sum(merge)) %>% filter(matches == 0) %>% select(df1.row),
                # Rows from df2 only.
                row.decisions %>% group_by(df2.row) %>% summarize(matches = sum(merge)) %>% filter(matches == 0) %>% select(df2.row)
              )
              # Iterate over the merge criteria and add columns that were used for matching
              # (from df1 if available; otherwise from df2).
              for(criterion in merge.criteria) {
                new.df[criterion$final.col.name] = coalesce(df1[new.df$df1.row,criterion$col.name.1],
                                                            df2[new.df$df2.row,criterion$col.name.2])
              }
              # Now add all the columns from either data frame that weren't used for
              # matching.
              for(other.col in setdiff(colnames(df1),
                                       sapply(merge.criteria, function(x) x$col.name.1))) {
                new.df[other.col] = df1[new.df$df1.row,other.col]
              }
              for(other.col in setdiff(colnames(df2),
                                       sapply(merge.criteria, function(x) x$col.name.2))) {
                new.df[other.col] = df2[new.df$df2.row,other.col]
              }
              # Return the result.
              return(new.df)
            }
            

            应用函数,我们就完成了

            df = merge.data.frames(df1, df2, merge.criteria)
            

            【讨论】:

              猜你喜欢
              • 1970-01-01
              • 1970-01-01
              • 2019-09-17
              • 2019-05-01
              • 1970-01-01
              • 2014-09-15
              • 1970-01-01
              • 2017-05-24
              • 2018-01-30
              相关资源
              最近更新 更多