【问题标题】:Merge two data.tables where all rows in dt2 are combined with each row in dt1合并两个 data.tables,其中 dt2 中的所有行都与 dt1 中的每一行合并
【发布时间】:2017-06-29 23:23:49
【问题描述】:

我有与此类似的数据,除了 dt12900 万 行,dt2 只有 15 行(不是 1500 万)

dt1 <- data.table(ID=1:4,City=c("Charlotte","DC","Salem","Boston"))
dt2 <- data.table(Birds=c("Saker","Peregrine","Barbary","Prarie","Golden","Coopers","Canary","Finch"),BirdType=c("Falcon","Falcon","Falcon","Falcon","Eagle","Hawk","Breakfast","Breakfast"))

这样的输出:

> dt1
   ID      City
1:  1 Charlotte
2:  2        DC
3:  3     Salem
4:  4    Boston

> dt2
       Birds  BirdType
1:     Saker    Falcon
2: Peregrine    Falcon
3:   Barbary    Falcon
4:    Prarie    Falcon
5:    Golden     Eagle
6:   Coopers      Hawk
7:    Canary Breakfast
8:     Finch Breakfast

我想合并两个 data.tables,其中 dt1 的每一行与 dt2 的所有行合并,最终给出一个具有 32 行的 data.table,输出如下:

> dtMerged
   ID      City  Birds     BirdType
1:  1  Charlotte Saker      Falcon
2:  1  Charlotte Peregrine  Falcon
3:  1  Charlotte Barbary    Falcon
4:  1  Charlotte Prarie     Falcon
5:  1  Charlotte Golden     Eagle   
6:  1  Charlotte Coopers    Hawk
7:  1  Charlotte Canary   Breakfast
8:  1  Charlotte Finch    Breakfast
9:   2        DC Saker      Falcon
10:  2        DC Peregrine  Falcon
11:  2        DC Barbary    Falcon
12:  2        DC Prarie     Falcon
13:  2        DC Golden     Eagle   
14:  2        DC Coopers    Hawk
15:  2        DC Canary   Breakfast
16:  2        DC Finch    Breakfast
17:  3     Salem Saker      Falcon
18:  3     Salem Saker      Falcon
etc...

任何关于如何最好地实现这一点的想法将不胜感激。 我在 Windows 7 PC 上使用 data.table 版本 1.10.4。谢谢。

【问题讨论】:

  • 您可以使用CJ 进行交叉连接,即CJ(do.call(paste, c(dt1, sep=",")), do.call(paste, c(dt2, sep=",")))[, unlist(lapply(.SD, tstrsplit, split = ","), recursive = FALSE)]
  • 谢谢@akrun。交叉加入是要走的路。
  • dt1[, as.list(dt2), by=names(dt1)] 似乎也有效。哦,或者也许可以反过来做,因为 dt2 在您的实际用例中的行数要少得多。此外,如果每只鸟的名称都是唯一的,您可以保留这两个较小的表,然后创建一个仅包含 Birds 和 ID 的新表,从而节省一些内存:CJ(ID = dt1$ID, BirdName = dt2$Birds)。然后,您可以根据需要从 ID 中查找城市,从名称中查找鸟类类型。

标签: r data.table


【解决方案1】:

正如@akrun 评论的那样,交叉连接似乎是解决问题的方法之一。为了实现它,我在this Stack Overflow post 中引用了@jangorecki CJ.dt 的一个非常简洁的函数来获得所需的解决方案:

CJ.dt = function(X,Y) {
  stopifnot(is.data.table(X),is.data.table(Y))
  k = NULL
  X = X[, c(k=1, .SD)]
  setkey(X, k)
  Y = Y[, c(k=1, .SD)]
  setkey(Y, NULL)
  X[Y, allow.cartesian=TRUE][, k := NULL][]
}

new_df <- CJ.dt(dt1, dt2)
setorder(new_df, ID)

重新排序后的完整输出如下所示:

> new_df

 ID      City     Birds  BirdType
 1:  1 Charlotte     Saker    Falcon
 2:  1 Charlotte Peregrine    Falcon
 3:  1 Charlotte   Barbary    Falcon
 4:  1 Charlotte    Prarie    Falcon
 5:  1 Charlotte    Golden     Eagle
 6:  1 Charlotte   Coopers      Hawk
 7:  1 Charlotte    Canary Breakfast
 8:  1 Charlotte     Finch Breakfast
 9:  2        DC     Saker    Falcon
10:  2        DC Peregrine    Falcon
11:  2        DC   Barbary    Falcon
12:  2        DC    Prarie    Falcon
13:  2        DC    Golden     Eagle
14:  2        DC   Coopers      Hawk
15:  2        DC    Canary Breakfast
16:  2        DC     Finch Breakfast
17:  3     Salem     Saker    Falcon
18:  3     Salem Peregrine    Falcon
19:  3     Salem   Barbary    Falcon
20:  3     Salem    Prarie    Falcon
21:  3     Salem    Golden     Eagle
22:  3     Salem   Coopers      Hawk
23:  3     Salem    Canary Breakfast
24:  3     Salem     Finch Breakfast
25:  4    Boston     Saker    Falcon
26:  4    Boston Peregrine    Falcon
27:  4    Boston   Barbary    Falcon
28:  4    Boston    Prarie    Falcon
29:  4    Boston    Golden     Eagle
30:  4    Boston   Coopers      Hawk
31:  4    Boston    Canary Breakfast
32:  4    Boston     Finch Breakfast

【讨论】:

  • 非常感谢@david-c。这在我更大的数据集上完美而快速地工作。
猜你喜欢
  • 1970-01-01
  • 2020-10-29
  • 2021-02-03
  • 2021-07-29
  • 2021-12-20
  • 1970-01-01
  • 2015-11-29
  • 1970-01-01
  • 2019-03-17
相关资源
最近更新 更多