【问题标题】：Create group based on fuzzy criteria根据模糊标准创建组
【发布时间】：2021-09-22 00:05:40
【问题描述】：

我有一个如下所示的数据框：

Name   Start_Date   End_Date
A      2015-01-01   2019-12-29
A      2017-03-25   NA
A      2019-10-17   NA
A      2012-04-16   2015-01-09
A      2002-06-01   2006-02-01
A      2005-12-24   NA
B      2018-01-23   NA

我想创建一个列，这样，如果两个观察值具有相同的 Name，并且一个观察值的 Start_Date 在另一个观察值的 End_Date 内相差 1 年，则它们被归类为同一组。

期望的输出：

Name   Start_Date   End_Date    Wanted
A      2015-01-01   2019-12-29  1
A      2017-03-25   NA          NA
A      2019-10-17   NA          1
A      2012-04-16   2015-01-09  1
A      2002-06-01   2006-02-01  2
A      2005-12-24   NA          2
B      2018-01-23   NA          NA

我正在寻找带有数据表的解决方案，但解决我的问题就足够了。

补充： 逐行解释
行：

第 4 行的开始日期比结束日期早 8 天（
开始日期比第 1 行的结束日期晚 2 年以上。与第 1 行不在同一组。与第 4、5 行相同。与这两个也不在同一组。
第 1 行的开始日期比结束日期早 2 个月（
见第 1 行。
见下文。
第 5 行的开始日期比结束日期早 3 个月（
没有其他名称 B 可比较。它在自己的组中。

因此，行 1、3 和 4 位于同一组中。行 5 和 6 在同一组中。 2 和 7 行没有组。

编辑：我已更新我的代码，以便在观察结果与另一个观察结果不匹配时具有一致的 Wanted 类别。 p>

【问题讨论】：

结束日期是 NA 那么如何比较开始日期和结束日期？
@Onyambu 有没有比模糊或非等值join 更有效的方法，然后用igraph 生成网络索引，然后按该索引对数据集进行分组？
@EconNoobie 这种关系应该是传递的吗？也就是说，如果"A" | 2009-01-01 | 2015-01-02 | ... 与 "A" “模糊分组” | 2016-01-01 | NA | ...，与"A"“模糊分组” | 2016-12-31 | 2017-01-01 | ... 反过来，那么他们应该全部属于同一组吗？从理论上讲，这可能会导致“紧密”“链接”的“链”，其中这些微小差异从“链接”到“链接”加起来，最终在第一个和最后一个“链接”之间跨越十年或更长时间——即使它们都是同一个“模糊组”的一部分。
@Greg 这是个好问题。我正在尝试创建“关闭”链接链，以便我可以随着时间的推移映射 A 的运动。因此，理想情况下，我希望标记一个观察的开始日期（2016-01-01）与两个不同的结束日期（2015-01-02 和 2016-12-31）“模糊分组”的情况，反之亦然。
@Greg 我已经更新了我的 OP 以保持一致并将它们保留为 NA。感谢您指出这一点！

标签： r data.table igraph

【解决方案1】：

方法

这是data.table 的首选解决方案：

我更喜欢 data.table 的解决方案，但任何解决方案都非常感谢！

虽然dplyr 和fuzzyjoin 可能看起来更优雅，但如果数据集足够大，它们的效率也可能会降低。

感谢 ThomasIsCoding 在 this other question 上击败我，an answer 利用 igraph 在图表中索引网络。在这里，网络是由“链接”（data.frame 行）组成的单独的“链”（Wanted 组），它们通过它们的“紧密性”（在它们的Start_Dates 和End_Dates 之间）连接起来。这种方法似乎有必要对transitive relationship ℛ 请求的here 进行建模

我正在尝试创建“关闭”链接链，以便我可以映射 A 随时间的运动。

还要注意保持 ℛ 的对称性（参见进一步阅读）。

根据same request

因此，理想情况下，我希望标记一个观察的开始日期（2016-01-01）与两个不同的结束日期（2015-01-02 和 2016-12-31）“模糊分组”的情况，反之亦然反之亦然。

还有你的further clarification

...我想要另一列指示 [flag]。

我还包含一个Flag 列，以标记Start_Date 与至少flag_at 其他行的End_Dates 匹配的每一行；反之亦然。

解决方案

使用您的示例data.frame，此处转载为my_data_frame

# Generate dataset as data.frame.
my_data_frame <- structure(list(Name = c("A", "A", "A", "A", "A", "A", "B"),
                                Start_Date = structure(c(16436, 17250, 18186, 15446, 11839, 13141, 17554),
                                                       class = "Date"),
                                End_Date = structure(c(18259, NA, NA, 16444, 13180, NA, NA),
                                                     class = "Date")),
                           row.names = c(NA, -7L),
                           class = "data.frame")

我们应用data.table和igraph（以及其他包）如下：

library(tidyverse)
library(data.table)
library(lubridate)
library(igraph)



# ...
# Code to generate your data.frame 'my_data_frame'.
# ...



# Treat dataset as a data.table.
my_data_table <- my_data_frame %>% data.table::as.data.table()


# Define the tolerance threshold as a (lubridate) "period": 1 year.
tolerance <- lubridate::years(1)

# Set the minimum number of matches for an row to be flagged: 2.
flag_at <- 2



#####################################
# BEGIN: Start Indexing the Groups. #
#####################################

# Begin indexing the "chain" (group) to which each "link" (row) belongs:
output <- my_data_table %>%
  
  ########################################################
  # STEP 1: Link the Rows That Are "Close" to Each Other #
  ########################################################
  
  # Prepare data.table for JOIN, by adding appropriate helper columns.
  .[, `:=`(# Uniquely identify each row (by row number).
           ID = .I,
           # Boundary columns for tolerance threshold.
           End_Low = End_Date - tolerance,
           End_High = End_Date + tolerance)] %>%
    
  # JOIN rows to each other, to obtain pairings.
  .[my_data_table,
    # Clearly describe the relation R: x R y whenever the 'Start_Date' of x is
    # close enough to (within the boundary columns for) the 'End_Date' of y.
    .(x.ID = i.ID, x.Name = i.Name, x.Start_Date = i.Start_Date, x.End_Date = i.End_Date,
      y.End_Low = x.End_Low, y.End_High = x.End_High, y.ID = x.ID, y.Name = x.Name),
    # JOIN criteria:
    on = .(# Only pair rows having the same name.
           Name,
           # Only pair rows whose start and end dates are within the tolerance
           # threshold of each other.
           End_Low <= Start_Date,
           End_High >= Start_Date),
    # Make it an OUTER JOIN, to include those rows without a match.
    nomatch = NA] %>%
  
  # Prepare pairings for network analysis.
  .[# Ensure no row is reflexively paired with itself.
    #   NOTE: This keeps the graph clean by trimming extraneous loops, and it
    #   prevents an "orphan" row from contributing to its own tally of matches.
    !(x.ID == y.ID) %in% TRUE,
    # !(x.ID == y.ID) %in% TRUE,
    # Simplify the dataset to only the pairings (by ID) of linked rows.
    .(from = x.ID, to = y.ID)]



#############################
# PAUSE: Count the Matches. #
#############################

# Count how many times each row has its 'End_Date' matched by a 'Start_Date'.
my_data_table$End_Matched <- output %>%
  
  # Include again the missing IDs for y that were never matched by the JOIN.
  .[my_data_table[, .(ID)], on = .(to = ID)] %>%
  
  # For each row y, count every other row x where x R y.
  .[, .(Matches = sum(!is.na(from))), by = to] %>%
  
  # Extract the count column.
  .$Matches


# Count how many times each row has its 'Start_Date' matched by an 'End_Date'.
my_data_table$Start_Matched <- output %>%
  
  # For each row x, count every other row y where x R y.
  .[, .(Matches = sum(!is.na(to))), by = from] %>%
  
  # Extract the count column.
  .$Matches



#########################################
# RESUME: Continue Indexing the Groups. #
#########################################

# Resume indexing:
output <- output %>%
  
  # Ignore nonmatches (NAs) which are annoying to process into a graph.
  .[from != to, ] %>%
  
  ###############################################################
  # STEP 2: Index the Separate "Chains" Formed By Those "Links" #
  ###############################################################
  
  # Convert pairings (by ID) of linked rows into an undirected graph.
  igraph::graph_from_data_frame(directed = FALSE) %>%
  
  # Find all groups (subgraphs) of transitively linked IDs.
  igraph::components() %>%
  
  # Pair each ID with its group index.
  igraph::membership() %>%
  
  # Tabulate those pairings...
  utils::stack() %>% utils::type.convert(as.is = TRUE) %>%
  
  # ...in a properly named data.table.
  data.table::as.data.table() %>% .[, .(ID = ind, Group_Index = values)] %>%
  
  
  
  #####################################################
  # STEP 3: Match the Original Rows to their "Chains" #
  #####################################################
  
  # LEFT JOIN (on ID) to match each original row to its group index (if any).
  .[my_data_table, on = .(ID)] %>%
  
  # Transform output into final form.
  .[# Sort into original order.
    order(ID),
    .(# Select existing columns.
      Name, Start_Date, End_Date,
      # Rename column having the group indices.
      Wanted = Group_Index,
      # Calculate column(s) to flag rows with sufficient matches.
      Flag = (Start_Matched >= flag_at) | (End_Matched >= flag_at))]



# View results.
output

结果

生成的output 如下data.table：

   Name Start_Date   End_Date Wanted  Flag
1:    A 2015-01-01 2019-12-29      1 FALSE
2:    A 2017-03-25       <NA>     NA FALSE
3:    A 2019-10-17       <NA>      1 FALSE
4:    A 2012-04-16 2015-01-09      1 FALSE
5:    A 2002-06-01 2006-02-01      2 FALSE
6:    A 2005-12-24       <NA>      2 FALSE
7:    B 2018-01-23       <NA>     NA FALSE

请记住，Flags 都是 FALSE，只是因为您的数据缺少任何与（至少）两个 End_Dates 匹配的Start_Date；以及与（至少）两个 Start_Dates 匹配的任何End_Date。

假设，如果我们将flag_at 降低到1，那么output 将Flag 每一行都有一个单个匹配（在任一方向）：

   Name Start_Date   End_Date Wanted  Flag
1:    A 2015-01-01 2019-12-29      1  TRUE
2:    A 2017-03-25       <NA>     NA FALSE
3:    A 2019-10-17       <NA>      1  TRUE
4:    A 2012-04-16 2015-01-09      1  TRUE
5:    A 2002-06-01 2006-02-01      2  TRUE
6:    A 2005-12-24       <NA>      2  TRUE
7:    B 2018-01-23       <NA>     NA FALSE

警告

因为某些data.tableoperations 修改by reference（或“就地”），所以my_data_table 的值会在整个工作流程中发生变化。在第 1 步之后，my_data_table 变为

   Name Start_Date   End_Date ID    End_Low   End_High
1:    A 2015-01-01 2019-12-29  1 2018-12-29 2020-12-29
2:    A 2017-03-25       <NA>  2       <NA>       <NA>
3:    A 2019-10-17       <NA>  3       <NA>       <NA>
4:    A 2012-04-16 2015-01-09  4 2014-01-09 2016-01-09
5:    A 2002-06-01 2006-02-01  5 2005-02-01 2007-02-01
6:    A 2005-12-24       <NA>  6       <NA>       <NA>
7:    B 2018-01-23       <NA>  7       <NA>       <NA>

与最初复制的 my_data_frame 的结构不同。

由于dplyr（在其他包中）通过值而不是通过引用分配，dplyr 解决方案将完全回避这个问题。

然而，您必须在修改工作流程时小心，因为在步骤 1 之前可用的 my_data_table 版本之后无法恢复。 p>

进一步阅读

虽然data.tables 的JOINing 是明确的方向——具有“右侧”和“左侧”——但该模型设法保留了您在此处描述的relational symmetry

如果...[其中一个]一个的“Start_Date”在其他观察的“End_Date”之内是 +- 1 年，则他们被归类为在同一组。

通过使用undirected graph。

当JOIN 将第一行?（具有2015-01-01 的Start_Date）与第四行?（具有2015-01-09 的End_Date）相关联时，我们收集到?的Start_Date与?的End_Date“足够接近”（1 年内）。所以我们在数学上说? ℛ ?，或者

?“与”?在同一组中。

但是，converse ? ℛ ? 不一定 出现在 JOINed 数据中，因为 ? 的 Start_Date 可能不会那么方便地降落? 的End_Date 附近。也就是说，JOINed 数据不一定表明

?“与”?在同一组中。

在后一种情况下，严格的directed graph（“有向图”）将不捕获同一组中?和?的共同成员资格。您可以通过在第 2 步的第一行设置directed = TRUE 来观察这个不和谐的差异

  igraph::graph_from_data_frame(directed = TRUE) %>%

并且还在下一行设置mode = "strong"

  igraph::components(mode = "strong") %>%

产生这些不相关的结果：

   Name Start_Date   End_Date Wanted  Flag
1:    A 2015-01-01 2019-12-29      4 FALSE
2:    A 2017-03-25       <NA>     NA FALSE
3:    A 2019-10-17       <NA>      3 FALSE
4:    A 2012-04-16 2015-01-09      5 FALSE
5:    A 2002-06-01 2006-02-01      2 FALSE
6:    A 2005-12-24       <NA>      1 FALSE
7:    B 2018-01-23       <NA>     NA FALSE

相比之下，可以通过使用无向图 (directed = FALSE) 将行正确分组；或通过更宽松的标准 (mode = "weak")。每当JOINed 数据中出现 ? ℛ ? 时，这些方法中的任何一种都可以有效地模拟 ? ℛ ? 的存在。

在对您描述的here 行为建模时，此对称属性尤其非常重要：

...一个观察的开始日期 (2016-01-01) 正在与两个不同的结束日期（2015-01-02 和 2016-12-31）“模糊分组”...

在这种情况下，您希望模型识别任意两行 ? 和 ? 必须在同一个组中 (? ℛ ?)，只要它们的 End_Dates 匹配其他行的相同 Start_Date ?： ? ℛ ? 和 ? ℛ ?。

假设我们知道 ? ℛ ? 和 ? ℛ ?。因为我们的模型保留了对称性，所以我们也可以从 ? ℛ ? 说 ? ℛ ?。由于我们现在知道 ? ℛ ? 和 ? ℛ ?，transitivity 意味着 ? ℛ ?。因此，我们的模型在 ? ℛ ? 和 ? ℛ ? 时识别出 ? ℛ ?！对于“反之亦然”，类似的逻辑就足够了。

我们可以通过使用来验证这个结果

my_data_frame <- my_data_frame %>%
  rbind(list(Name = "A",
             Start_Date = as.Date("2010-01-01"),
             End_Date = as.Date("2015-01-05")))

在工作流之前将第 8 行附加到 my_data_frame：

    Name Start_Date   End_Date
  1    A 2015-01-01 2019-12-29
# ⋮    ⋮      ⋮           ⋮
  4    A 2012-04-16 2015-01-09
# ⋮    ⋮      ⋮           ⋮
  8    A 2010-01-01 2015-01-05

第 8 行作为我们的 ?，其中 ? 是第 1 行，? 是第 4 行，和之前一样。事实上，output 正确地将 ? 和 ? 分类为属于同一组 1：? ℛ ?。

   Name Start_Date   End_Date Wanted  Flag
1:    A 2015-01-01 2019-12-29      1  TRUE
2:    A 2017-03-25       <NA>     NA FALSE
3:    A 2019-10-17       <NA>      1 FALSE
4:    A 2012-04-16 2015-01-09      1 FALSE
5:    A 2002-06-01 2006-02-01      2 FALSE
6:    A 2005-12-24       <NA>      2 FALSE
7:    B 2018-01-23       <NA>     NA FALSE
8:    A 2010-01-01 2015-01-05      1 FALSE

同样，output 正确地位于第一行 Flags，其Start_Date 现在与第 4 行和第 8 行中的两个 End_Dates 匹配。

干杯！

【讨论】：

非常感谢@Greg！这非常有帮助，解决了我的大部分问题！为了澄清，当我说我想标记一个观察的开始日期与两个以上观察的结束日期匹配的情况时，我的意思是我想要另一列来表明这一点。但这并没有反映在OP中，我也不清楚，所以我绝对理解！
@EconNoobie 我刚刚更新了它以满足您的需求。享受吧！
谢谢！我只有一个后续问题。我承诺的最后一个。我应该如何标记那些 End_Date 也与两个以上 Start_Dates 匹配的情况？我正在尝试在“标志”变量下捕获这些案例。
@EconNoobie 我刚刚想通了，虽然工作流程有点麻烦。
@EconNoobie 您是否想要一个单独的标志用于 (1) 多个匹配 Start_Date 和 (2) 多个匹配 End_Date？还是您想要一个单一标志？如果你想要一个标志，当Start_Date 或 (|) End_Date 有多个匹配项时是否应该“提高”它？