方法
这是data.table 的首选解决方案:
我更喜欢 data.table 的解决方案,但任何解决方案都非常感谢!
虽然dplyr 和fuzzyjoin 可能看起来更优雅,但如果数据集足够大,它们的效率也可能会降低。
感谢 ThomasIsCoding 在 this other question 上击败我,an answer 利用 igraph 在图表中索引网络。在这里,网络是由“链接”(data.frame 行)组成的单独的“链”(Wanted 组),它们通过它们的“紧密性”(在它们的Start_Dates 和End_Dates 之间)连接起来。这种方法似乎有必要对transitive relationship ℛ 请求的here 进行建模
我正在尝试创建“关闭”链接链,以便我可以映射 A 随时间的运动。
还要注意保持 ℛ 的对称性(参见进一步阅读)。
根据same request
因此,理想情况下,我希望标记一个观察的开始日期(2016-01-01)与两个不同的结束日期(2015-01-02 和 2016-12-31)“模糊分组”的情况,反之亦然反之亦然。
还有你的further clarification
...我想要另一列指示 [flag]。
我还包含一个Flag 列,以标记Start_Date 与至少flag_at 其他行的End_Dates 匹配的每一行;反之亦然。
解决方案
使用您的示例data.frame,此处转载为my_data_frame
# Generate dataset as data.frame.
my_data_frame <- structure(list(Name = c("A", "A", "A", "A", "A", "A", "B"),
Start_Date = structure(c(16436, 17250, 18186, 15446, 11839, 13141, 17554),
class = "Date"),
End_Date = structure(c(18259, NA, NA, 16444, 13180, NA, NA),
class = "Date")),
row.names = c(NA, -7L),
class = "data.frame")
我们应用data.table和igraph(以及其他包)如下:
library(tidyverse)
library(data.table)
library(lubridate)
library(igraph)
# ...
# Code to generate your data.frame 'my_data_frame'.
# ...
# Treat dataset as a data.table.
my_data_table <- my_data_frame %>% data.table::as.data.table()
# Define the tolerance threshold as a (lubridate) "period": 1 year.
tolerance <- lubridate::years(1)
# Set the minimum number of matches for an row to be flagged: 2.
flag_at <- 2
#####################################
# BEGIN: Start Indexing the Groups. #
#####################################
# Begin indexing the "chain" (group) to which each "link" (row) belongs:
output <- my_data_table %>%
########################################################
# STEP 1: Link the Rows That Are "Close" to Each Other #
########################################################
# Prepare data.table for JOIN, by adding appropriate helper columns.
.[, `:=`(# Uniquely identify each row (by row number).
ID = .I,
# Boundary columns for tolerance threshold.
End_Low = End_Date - tolerance,
End_High = End_Date + tolerance)] %>%
# JOIN rows to each other, to obtain pairings.
.[my_data_table,
# Clearly describe the relation R: x R y whenever the 'Start_Date' of x is
# close enough to (within the boundary columns for) the 'End_Date' of y.
.(x.ID = i.ID, x.Name = i.Name, x.Start_Date = i.Start_Date, x.End_Date = i.End_Date,
y.End_Low = x.End_Low, y.End_High = x.End_High, y.ID = x.ID, y.Name = x.Name),
# JOIN criteria:
on = .(# Only pair rows having the same name.
Name,
# Only pair rows whose start and end dates are within the tolerance
# threshold of each other.
End_Low <= Start_Date,
End_High >= Start_Date),
# Make it an OUTER JOIN, to include those rows without a match.
nomatch = NA] %>%
# Prepare pairings for network analysis.
.[# Ensure no row is reflexively paired with itself.
# NOTE: This keeps the graph clean by trimming extraneous loops, and it
# prevents an "orphan" row from contributing to its own tally of matches.
!(x.ID == y.ID) %in% TRUE,
# !(x.ID == y.ID) %in% TRUE,
# Simplify the dataset to only the pairings (by ID) of linked rows.
.(from = x.ID, to = y.ID)]
#############################
# PAUSE: Count the Matches. #
#############################
# Count how many times each row has its 'End_Date' matched by a 'Start_Date'.
my_data_table$End_Matched <- output %>%
# Include again the missing IDs for y that were never matched by the JOIN.
.[my_data_table[, .(ID)], on = .(to = ID)] %>%
# For each row y, count every other row x where x R y.
.[, .(Matches = sum(!is.na(from))), by = to] %>%
# Extract the count column.
.$Matches
# Count how many times each row has its 'Start_Date' matched by an 'End_Date'.
my_data_table$Start_Matched <- output %>%
# For each row x, count every other row y where x R y.
.[, .(Matches = sum(!is.na(to))), by = from] %>%
# Extract the count column.
.$Matches
#########################################
# RESUME: Continue Indexing the Groups. #
#########################################
# Resume indexing:
output <- output %>%
# Ignore nonmatches (NAs) which are annoying to process into a graph.
.[from != to, ] %>%
###############################################################
# STEP 2: Index the Separate "Chains" Formed By Those "Links" #
###############################################################
# Convert pairings (by ID) of linked rows into an undirected graph.
igraph::graph_from_data_frame(directed = FALSE) %>%
# Find all groups (subgraphs) of transitively linked IDs.
igraph::components() %>%
# Pair each ID with its group index.
igraph::membership() %>%
# Tabulate those pairings...
utils::stack() %>% utils::type.convert(as.is = TRUE) %>%
# ...in a properly named data.table.
data.table::as.data.table() %>% .[, .(ID = ind, Group_Index = values)] %>%
#####################################################
# STEP 3: Match the Original Rows to their "Chains" #
#####################################################
# LEFT JOIN (on ID) to match each original row to its group index (if any).
.[my_data_table, on = .(ID)] %>%
# Transform output into final form.
.[# Sort into original order.
order(ID),
.(# Select existing columns.
Name, Start_Date, End_Date,
# Rename column having the group indices.
Wanted = Group_Index,
# Calculate column(s) to flag rows with sufficient matches.
Flag = (Start_Matched >= flag_at) | (End_Matched >= flag_at))]
# View results.
output
结果
生成的output 如下data.table:
Name Start_Date End_Date Wanted Flag
1: A 2015-01-01 2019-12-29 1 FALSE
2: A 2017-03-25 <NA> NA FALSE
3: A 2019-10-17 <NA> 1 FALSE
4: A 2012-04-16 2015-01-09 1 FALSE
5: A 2002-06-01 2006-02-01 2 FALSE
6: A 2005-12-24 <NA> 2 FALSE
7: B 2018-01-23 <NA> NA FALSE
请记住,Flags 都是 FALSE,只是因为您的数据缺少任何与(至少)两个 End_Dates 匹配的Start_Date;以及与(至少)两个 Start_Dates 匹配的任何End_Date。
假设,如果我们将flag_at 降低到1,那么output 将Flag 每一行都有一个单个匹配(在任一方向):
Name Start_Date End_Date Wanted Flag
1: A 2015-01-01 2019-12-29 1 TRUE
2: A 2017-03-25 <NA> NA FALSE
3: A 2019-10-17 <NA> 1 TRUE
4: A 2012-04-16 2015-01-09 1 TRUE
5: A 2002-06-01 2006-02-01 2 TRUE
6: A 2005-12-24 <NA> 2 TRUE
7: B 2018-01-23 <NA> NA FALSE
警告
因为某些data.tableoperations 修改by reference(或“就地”),所以my_data_table 的值会在整个工作流程中发生变化。在第 1 步之后,my_data_table 变为
Name Start_Date End_Date ID End_Low End_High
1: A 2015-01-01 2019-12-29 1 2018-12-29 2020-12-29
2: A 2017-03-25 <NA> 2 <NA> <NA>
3: A 2019-10-17 <NA> 3 <NA> <NA>
4: A 2012-04-16 2015-01-09 4 2014-01-09 2016-01-09
5: A 2002-06-01 2006-02-01 5 2005-02-01 2007-02-01
6: A 2005-12-24 <NA> 6 <NA> <NA>
7: B 2018-01-23 <NA> 7 <NA> <NA>
与最初复制的 my_data_frame 的结构不同。
由于dplyr(在其他包中)通过值而不是通过引用分配,dplyr 解决方案将完全回避这个问题。
然而,您必须在修改工作流程时小心,因为在步骤 1 之前可用的 my_data_table 版本之后无法恢复。 p>
进一步阅读
虽然data.tables 的JOINing 是明确的方向——具有“右侧”和“左侧”——但该模型设法保留了您在此处描述的relational symmetry
如果...[其中一个]一个的“Start_Date”在其他观察的“End_Date”之内是 +- 1 年,则他们被归类为在同一组。
通过使用undirected graph。
当JOIN 将第一行?(具有2015-01-01 的Start_Date)与第四行?(具有2015-01-09 的End_Date)相关联时,我们收集到?的Start_Date与?的End_Date“足够接近”(1 年内)。所以我们在数学上说? ℛ ?,或者
?“与”?在同一组中。
但是,converse ? ℛ ? 不一定 出现在 JOINed 数据中,因为 ? 的 Start_Date 可能不会那么方便地降落? 的End_Date 附近。也就是说,JOINed 数据不一定表明
?“与”?在同一组中。
在后一种情况下,严格的directed graph(“有向图”)将不捕获同一组中?和?的共同成员资格。您可以通过在第 2 步的第一行设置directed = TRUE 来观察这个不和谐的差异
igraph::graph_from_data_frame(directed = TRUE) %>%
并且还在下一行设置mode = "strong"
igraph::components(mode = "strong") %>%
产生这些不相关的结果:
Name Start_Date End_Date Wanted Flag
1: A 2015-01-01 2019-12-29 4 FALSE
2: A 2017-03-25 <NA> NA FALSE
3: A 2019-10-17 <NA> 3 FALSE
4: A 2012-04-16 2015-01-09 5 FALSE
5: A 2002-06-01 2006-02-01 2 FALSE
6: A 2005-12-24 <NA> 1 FALSE
7: B 2018-01-23 <NA> NA FALSE
相比之下,可以通过使用无向图 (directed = FALSE) 将行正确分组;或通过更宽松的标准 (mode = "weak")。每当JOINed 数据中出现 ? ℛ ? 时,这些方法中的任何一种都可以有效地模拟 ? ℛ ? 的存在。
在对您描述的here 行为建模时,此对称属性尤其非常重要:
...一个观察的开始日期 (2016-01-01) 正在与两个不同的结束日期(2015-01-02 和 2016-12-31)“模糊分组”...
在这种情况下,您希望模型识别任意两行 ? 和 ? 必须在同一个组中 (? ℛ ?),只要它们的 End_Dates 匹配其他行的相同 Start_Date ?: ? ℛ ? 和 ? ℛ ?。
假设我们知道 ? ℛ ? 和 ? ℛ ?。因为我们的模型保留了对称性,所以我们也可以从 ? ℛ ? 说 ? ℛ ?。由于我们现在知道 ? ℛ ? 和 ? ℛ ?,transitivity 意味着 ? ℛ ?。因此,我们的模型在 ? ℛ ? 和 ? ℛ ? 时识别出 ? ℛ ?!对于“反之亦然”,类似的逻辑就足够了。
我们可以通过使用来验证这个结果
my_data_frame <- my_data_frame %>%
rbind(list(Name = "A",
Start_Date = as.Date("2010-01-01"),
End_Date = as.Date("2015-01-05")))
在工作流之前将第 8 行附加到 my_data_frame:
Name Start_Date End_Date
1 A 2015-01-01 2019-12-29
# ⋮ ⋮ ⋮ ⋮
4 A 2012-04-16 2015-01-09
# ⋮ ⋮ ⋮ ⋮
8 A 2010-01-01 2015-01-05
第 8 行作为我们的 ?,其中 ? 是第 1 行,? 是第 4 行,和之前一样。事实上,output 正确地将 ? 和 ? 分类为属于同一组 1:? ℛ ?。
Name Start_Date End_Date Wanted Flag
1: A 2015-01-01 2019-12-29 1 TRUE
2: A 2017-03-25 <NA> NA FALSE
3: A 2019-10-17 <NA> 1 FALSE
4: A 2012-04-16 2015-01-09 1 FALSE
5: A 2002-06-01 2006-02-01 2 FALSE
6: A 2005-12-24 <NA> 2 FALSE
7: B 2018-01-23 <NA> NA FALSE
8: A 2010-01-01 2015-01-05 1 FALSE
同样,output 正确地位于第一行 Flags,其Start_Date 现在与第 4 行和第 8 行中的两个 End_Dates 匹配。
干杯!