【问题标题】:Find overlaps in data table在数据表中查找重叠
【发布时间】:2016-10-25 14:14:15
【问题描述】:

我有一些带有 ID、日期和整数值的数据,用于关联 ID 和开始日期组合,每个 ID 有多个日期。

我想创建一个列指示:

1) 告诉我一个 ID 的整数总和是否 >= 14,或者在 12 个月期间内有 4 个单独的整数。

这里有一个类似的问题,但我的类别有点复杂: Create new column based on condition that exists within a rolling date

非常感谢任何帮助!

这是一些数据的输入:

structure(list(ID = c("90939293", "90963328", "90092983", 
"90032926", "90944838", "90092983", "90062392", "90224939", "90202398", 
"90926203", "90936043", "90329263", "90944838", "90232033", "90980903", 
"90924463", "90299292", "90933383", "90209349", "90092983", "90022988", 
"90022293", "90933383", "90092983", "90299240", "90963033", "90004923", 
"90292998", "90986096", "90980903", "90336692", "90933383", "90022988", 
"90069992", "90062392", "90209248", "90924463", "90092983", "90933383", 
"90022293", "90062392", "90004923", "90233269", "90329263", "90229202", 
"90309943", "90299292", "90036820", "90329263", "90232033", "90329263", 
"90336692", "90963033", "90224939", "90924463", "90069992", "90092983", 
"90934923", "90926203", "90222333", "90092983", "90299292", "90202398", 
"90004923", "90233269", "90926203", "90222333", "90224939", "90232033", 
"90933383", "90022293", "90022988", "90934923", "90069992", "90329263", 
"90209349", "90022293", "90309943", "90299240", "90022293", "90336692", 
"90020334", "90933383", "90290384", "90224939", "90980903", "90299240", 
"90299292", "90202398", "90022346"), Date = structure(c(15972, 
16009, 16010, 16010, 16007, 16010, 16006, 16010, 16007, 16008, 
15997, 16007, 16007, 16002, 16008, 16006, 16006, 16006, 16009, 
16010, 16006, 16006, 16006, 16010, 15995, 16008, 16008, 16010, 
16009, 16008, 16010, 16006, 16006, 16009, 16006, 16006, 16006, 
16010, 16006, 16006, 16006, 16008, 16009, 16007, 16010, 16007, 
16006, 16009, 16007, 16002, 16007, 16010, 16008, 16010, 16006, 
16009, 16010, 15936, 16008, 16008, 16010, 16006, 16007, 16008, 
16009, 16008, 16008, 16010, 16002, 16006, 16006, 16006, 15936, 
16009, 16007, 16009, 16006, 16007, 15995, 16006, 16010, 16006, 
16006, 16010, 16010, 16008, 15995, 16006, 16007, 16008), class = "Date"), 
    Integer = c(39, 2, 1, 1, 4, 1, 5, 1, 4, 3, 14, 4, 4, 9, 
    3, 5, 5, 5, 2, 1, 5, 5, 5, 1, 16, 3, 3, 1, 2, 3, 1, 5, 5, 
    2, 5, 5, 5, 1, 5, 5, 5, 3, 2, 4, 1, 4, 5, 2, 4, 9, 4, 1, 
    3, 1, 5, 2, 1, 75, 3, 3, 1, 5, 4, 3, 2, 3, 3, 1, 9, 5, 5, 
    5, 75, 2, 4, 2, 5, 4, 16, 5, 1, 5, 5, 1, 1, 3, 16, 5, 4, 
    3)), .Names = c("ID", "Date", "Integer"
), row.names = c("200086", "200066", "200050", "200064", "200078", 
"200050.1", "200069", "200082", "200083", "200053", "200056", 
"200055", "200078.1", "200079", "200051", "200089", "200052", 
"200057", "200061", "200050.2", "200060", "200080", "200057.1", 
"200050.3", "200068", "200071", "200070", "200059", "200062", 
"200051.1", "200067", "200057.2", "200060.1", "200072", "200069.1", 
"200073", "200089.1", "200050.4", "200057.3", "200080.1", "200069.2", 
"200070.1", "200081", "200054", "200063", "200075", "200052.1", 
"200074", "200054.1", "200079.1", "200055.1", "200067.1", "200071.1", 
"200082.1", "200089.2", "200072.1", "200050.5", "200084", "200053.1", 
"200088", "200050.6", "200052.2", "200083.1", "200070.2", "200081.1", 
"200053.2", "200088.1", "200082.2", "200079.2", "200057.4", "200080.2", 
"200060.2", "200084.1", "200072.2", "200055.2", "200061.1", "200080.3", 
"200075.1", "200068.1", "200080.4", "200067.2", "200065", "200057.5", 
"200090", "200082.3", "200051.2", "200068.2", "200052.3", "200083.2", 
"200076"), class = "data.frame")

【问题讨论】:

  • “每个 ID 有多个日期”-any(duplicated(df$X1)) 不同意您的示例数据。您的 ID(我假设第一列,在您的示例中称为 X1)是唯一的。还是您的意思是某些日期有多个 ID?不管怎样,做一个 small 的例子而不是 100 行。
  • 这不清楚:“告诉我一个 ID 在 12 个月内是否有 14 个整数或 4 个单独整数的总和”。 “14 个整数之和”是什么意思? 1+2+3+4+1+2+3+4+1+2+3+4+7+99 是 14 个整数的和。你不是这个意思吗?
  • 我认为您可能在这里问了太多问题,因此不鼓励部分答案,因此除非有人解决您所有的问题,否则您将不会得到任何答案。建议你删除这篇文章并创建几个 - 第一个将是如何找到哪些 ID 的 Integer 列值的总和等于 14。
  • @Spacedman 您好,我已经更新了问题以反映您的 cmets
  • 这个例子不可能是正确的,因为一个 ID 没有多个唯一日期的实例。如果它对赏金来说足够重要,为什么不花时间清楚地写下问题和示例,以便我们为您提供帮助。

标签: r


【解决方案1】:

您的输入为“x”:

library(data.table)

setDT(x, key = "Date")

# test 1
x[, `:=` (
  test1 = sum(Integer) >= 14
), by = ID]

# test2
y = x[, .(
  count12 = uniqueN(Integer)
  ), by = .(start = Date, end = Date - 365)]

# combine
z = merge(x, y, by.x = "Date", by.y = "start")
z[, end := NULL]
z[, flag := test1 | count12 == 4]

【讨论】:

    【解决方案2】:

    这是您所要求的内容。现在,查找 Integer 总和大于 14 的 ID 就像按 ID 分组并检查每个 ID 的 Integer 列的总和是否 >= 14 一样简单,或者在 dplyr 中:df %>% group_by(ID) %>% mutate(conditional = sum(Integer) >= 14)。在 12 个月内找到(至少?) 4 个 ID 显然更难。我的解决方案遵循this 的答案来计算窗口计数。

    只有一个警告:因为roll_sum 通过滚动行数来工作,所以我使用的解决方案依赖于每个 ID 每天只有一行。在您的示例数据框中,实际上有多个相同 ID 日期的条目,但它们似乎是重复的,所以我删除了它们。如果它们不是,并且需要针对sum(Integer) >= 14 的条件计算重复值,则可以预先对它们进行汇总而不是删除它们(例如:df %>% group_by(ID, Date) %>% summarize(Integer = sum(Integer))),以便每个 ID 只有一个条目日期。

    library(dplyr)
    library(tidyr)
    library(RcppRoll)
    
    df_tmp <- df
    df <- df_tmp  %>% 
      group_by(ID, Date) %>% 
      filter(n() == 1) %>% # this line removes duplicate columns 
      ungroup() %>%
      complete(ID, 
               Date=seq(from=min(Date)-365,to=max(Date), by=1), 
               fill=list(Integer=0)) %>% # we use complete to add in a row for all IDs for every single date since a year before the first obs.
      arrange(ID, Date) %>%
      group_by(ID) %>% 
      mutate(roll_count = roll_sum(x = Integer != 0, n = 365, fill=0, align="right"), # this calculates the rolling sum using n = 365 as a stand-in for 12 months
             conditional = sum(Integer) >= 14 || roll_count >= 4 ) %>% 
      ungroup() %>%
      right_join(df, by = c("ID","Date", "Integer")) # right_join with the original data to remove dummy dates
    

    希望这会有所帮助!

    【讨论】:

      猜你喜欢
      • 2015-01-31
      • 1970-01-01
      • 2022-01-10
      • 1970-01-01
      • 2019-08-18
      • 2020-08-26
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多