【问题标题】:data.table join with datedata.table 加入日期
【发布时间】:2017-10-17 08:20:12
【问题描述】:

您好,我正在尝试使用组和日期在范围内提取一些 id

> d1
    id group       Date
 1:  1     A 2017-07-02
 2:  2     A 2017-07-04
 3:  3     A 2017-05-15
 4:  4     A 2017-08-02
 5:  5     B 2017-12-28
 6:  6     B 2015-07-02
 7:  7     B 2012-07-02
 8:  8     B 2018-07-02
 9:  9     C 2017-07-02
10: 10     C 2017-07-02
11: 11     C 2017-07-02
12: 12     C 2017-07-04
13: 13     D 2017-05-15
14: 14     D 2017-08-02
15: 15     D 2017-12-28
16: 16     D 2015-07-02
17: 17     E 2012-07-02
18: 18     E 2018-07-02
19: 19     E 2017-07-02
20: 20     E 2017-07-02

> d2
   group timestamp1 timestamp2
1:     A 2015-07-01 2017-07-20
2:     A 2020-07-12 2017-07-15
3:     B 2017-05-15 2020-05-22

我想要 d1 中的 id 匹配 d2 日期范围和组

   group timestamp1 timestamp2 id
1:     A 2017-07-02 2017-07-02  1
2:     A 2017-07-04 2017-07-04  2
3:     A 2017-05-15 2017-05-15  3
4:     B 2017-12-28 2017-12-28  5
5:     B 2018-07-02 2018-07-02  8

我检查了这个How to perform join over date ranges using data.table? 我认为这是解决方案,但我无法让它发挥作用。

日期、时间戳、时间戳2在POSIXct

请帮忙:)

【问题讨论】:

  • 除了id == 6,d1 中的所有值似乎都在d2 中的值范围内,因此,您想要的输出不是很清楚。
  • 我想要group == group & Date >=timestamp1 & Date <=timestamp1 这样的东西,希望它更好
  • 我了解您想要什么,您可以通过执行类似 d2[d1, on = .(group, timestamp1 <= Date, timestamp2 >= Date), nomatch = 0L, mult = "first"] 的操作来实现这一点(假设您有正确的 Date 类),但您想要的输出没有意义
  • sry 我只是未能在我的示例中制定一个好的日期范围,它正在工作 thx
  • 那么你没有任何匹配项。日期范围不重叠或列类不属于 Date 类。尝试制作一个适当的可重现示例。

标签: r date join data.table


【解决方案1】:

OP 已请求使用data.table 进行非等内连接

library(data.table)
d2[d1, on = .(group, timestamp1 <= Date, timestamp2 >= Date), nomatch = 0L]
   group timestamp1 timestamp2 id
1:     A 2017-07-02 2017-07-02  1
2:     A 2017-07-04 2017-07-04  2
3:     A 2017-05-15 2017-05-15  3
4:     B 2017-12-28 2017-12-28  5
5:     B 2018-07-02 2018-07-02  8

数据

library(data.table)
d1 <- fread(
"rn id group       Date
 1:  1     A 2017-07-02
 2:  2     A 2017-07-04
 3:  3     A 2017-05-15
 4:  4     A 2017-08-02
 5:  5     B 2017-12-28
 6:  6     B 2015-07-02
 7:  7     B 2012-07-02
 8:  8     B 2018-07-02
 9:  9     C 2017-07-02
10: 10     C 2017-07-02
11: 11     C 2017-07-02
12: 12     C 2017-07-04
13: 13     D 2017-05-15
14: 14     D 2017-08-02
15: 15     D 2017-12-28
16: 16     D 2015-07-02
17: 17     E 2012-07-02
18: 18     E 2018-07-02
19: 19     E 2017-07-02
20: 20     E 2017-07-02", drop = 1L)[
  , Date := as.POSIXct(Date)]

d2 <- fread(
  "rn    group timestamp1 timestamp2
1:     A 2015-07-01 2017-07-20
2:     A 2020-07-12 2017-07-15
3:     B 2017-05-15 2020-05-22", drop = 1L)
cols = c("timestamp1", "timestamp2")
d2[, (cols) := lapply(.SD, as.POSIXct), .SDcols = cols]

【讨论】:

  • 我刚刚注意到 David Arenburg 已经在 comment 中发布了相同的方法。因此,此答案作为社区 wiki 发布。
【解决方案2】:

使用 data.table

df2$timestamp1 <- as.Date(df2$timestamp1, format = "%Y-%m-%d")
df2$timestamp2 <- as.Date(df2$timestamp2, format = "%Y-%m-%d")
df1$Date <- as.Date(df1$Date, format = "%Y-%m-%d")

df1T <- data.table(df1, key = "group")
df2T <- data.table(df2, key = "group")
df3f <- df1T[df2T]
df3f[df3f$timestamp1 < df3f$Date & df3f$Date < df3f$timestamp2 , ]

   id group       Date timestamp1 timestamp2
1:  1     A 2017-07-02 2015-07-01 2017-07-20
2:  2     A 2017-07-04 2015-07-01 2017-07-20
3:  3     A 2017-05-15 2015-07-01 2017-07-20
4:  5     B 2017-12-28 2017-05-15 2020-05-22
5:  8     B 2018-07-02 2017-05-15 2020-05-22

您也可以像这样在 dplyr 中使用左连接和过滤器:

df3 <- df2%>%left_join(df1, by  = "group")%>%
  mutate(timestamp1 = as.Date(timestamp1, format = "%Y-%m-%d"),
         timestamp2 = as.Date(timestamp2, format = "%Y-%m-%d"),
         Date = as.Date(Date, format = "%Y-%m-%d"))%>%
         filter(timestamp1<Date&Date<timestamp2)%>%print()

  group timestamp1 timestamp2 id       Date
1     A 2015-07-01 2017-07-20  1 2017-07-02
2     A 2015-07-01 2017-07-20  2 2017-07-04
3     A 2015-07-01 2017-07-20  3 2017-05-15
4     B 2017-05-15 2020-05-22  5 2017-12-28
5     B 2017-05-15 2020-05-22  8 2018-07-02

【讨论】:

  • 自 1.9.6 版(CRAN 2015 年 9 月 19 日)起,如果加入 data.tables 时使用 on 参数,则不再需要设置密钥。
【解决方案3】:

sqldf 的另一个选项:

library(sqldf)

sqldf("select df1.id, df1.Date, df2.* from df1
      inner join df2 on df1.'group' = df2.'group'
      where df1.Date between df2.timestamp1 and df2.timestamp2")

结果:

  id       Date group timestamp1 timestamp2
1  1 2017-07-02     A 2015-07-01 2017-07-20
2  2 2017-07-04     A 2015-07-01 2017-07-20
3  3 2017-05-15     A 2015-07-01 2017-07-20
4  5 2017-12-28     B 2017-05-15 2020-05-22
5  8 2018-07-02     B 2017-05-15 2020-05-22

数据:

df1 = read.table(text = "    id group       Date
                  1:  1     A 2017-07-02
                 2:  2     A 2017-07-04
                 3:  3     A 2017-05-15
                 4:  4     A 2017-08-02
                 5:  5     B 2017-12-28
                 6:  6     B 2015-07-02
                 7:  7     B 2012-07-02
                 8:  8     B 2018-07-02
                 9:  9     C 2017-07-02
                 10: 10     C 2017-07-02
                 11: 11     C 2017-07-02
                 12: 12     C 2017-07-04
                 13: 13     D 2017-05-15
                 14: 14     D 2017-08-02
                 15: 15     D 2017-12-28
                 16: 16     D 2015-07-02
                 17: 17     E 2012-07-02
                 18: 18     E 2018-07-02
                 19: 19     E 2017-07-02
                 20: 20     E 2017-07-02", header = TRUE, row.names = 1)

df2 = read.table(text = "   group timestamp1 timestamp2
1:     A 2015-07-01 2017-07-20
2:     A 2020-07-12 2017-07-15
3:     B 2017-05-15 2020-05-22", header = TRUE, row.names = 1)

【讨论】:

    猜你喜欢
    • 2022-12-16
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2021-01-15
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多