【问题标题】:Group individuals based on direct and indirect relationships根据直接和间接关系对个人进行分组
【发布时间】:2021-10-05 09:58:36
【问题描述】:

我想根据地址和地块所有权将个人分成不同的家庭。如果人们居住在相同的地址,并且如果他们通过至少一个地块的所有权直接或间接联系在一起,则他们属于同一家庭。

个人之间的联系可以是直接,即两个人有一个共同的包裹。但是链接也可以是间接,通过交叉链接形成链 - 两个人有一个共同的包裹,其中一个人与其他人有一个共同的包裹,并且都住在同一个地址。

这里有一些例子:

  • 如果一个人 (9) 独自住在他的地址 (C),即使另一个人 (6) 也拥有他或她的地块,他也将独自拥有他的家人。
  • 如果两个人(12 和 13)住在同一个地址 (F) 并拥有同一个地块 (w),那么他们属于同一个家庭。但是如果三个人住在同一个地址 (B) 但只有两个人 (7 和 8) 拥有同一个地块 (r) 和第三个人 (6) 住在这个地址 (B) 但拥有另一个地块 (m)只有拥有同一个地块的两个人来自同一个家庭。
  • 如果在同一个地址 (A),有 4 个人居住(1、2、3 和 4),如果个人(1、2 和 3)通过拥有多个包裹(m、n 和 o)联系在一起,那么他们属于同一家庭,而同样居住在该地址但不拥有这 3 个地块中的任何一个但另一个 (p) 的人 (4) 不属于同一家庭。

我有三个变量:地址 ID、所有者 ID 和包裹 ID。我想得到一个家庭号码。这是一个示例表:

 id_address id_owner id_parcel id_household
          A        1         m            1
          A        1         n            1
          A        2         n            1
          A        2         o            1
          A        3         o            1
          A        4         p            2
          A        5         q            3
          B        6         s            4
          B        7         r            5
          B        8         r            5
          C        9         s            6
          D       10         t            7
          E       11         u            8
          E       11         v            8
          F       12         w            9
          F       13         w            9

我的第一反应是循环,但我有 800,000 行,可能需要太长时间。

“id_household”是我要创建的变量的示例数据:

structure(list(id_address = c("A", "A", "A", "A", "A", "A", "A", 
"B", "B", "B", "C", "D", "E", "E", "F", "F"), id_owner = c(1L, 
1L, 2L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 11L, 12L, 13L
), id_parcel = c("m", "n", "n", "o", "o", "p", "q", "s", "r", 
"r", "s", "t", "u", "v", "w", "w"), id_household = c(1L, 1L, 
1L, 1L, 1L, 2L, 3L, 4L, 5L, 5L, 6L, 7L, 8L, 8L, 9L, 9L)), class = "data.frame", row.names = c(NA, 
-16L))

【问题讨论】:

标签: r grouping recode


【解决方案1】:

您的问题可能会被视为图形问题。 igraph 包提供了出色的工具。列 'id_owner' 和 'id_parcel' 可以被视为一个边列表。 components 函数提供“每个顶点所属的簇 id”。我使用data.table 进行一般数据处理。

library(data.table)
library(igraph)

setDT(d)
d2 = d[ , {
  # create graph. columns id_owner and id_parcel are treated as an edge list.
  g = graph_from_data_frame(.SD)

  # get components of the graph that are directly or indirectly connected
  mem = components(g)$membership

  # grab the memberships and their names (i.e. the vertices) 
  .(id_parcel = names(mem), mem = mem)
 
  # do the above for each id_address
}, by = id_address]

# join the memberships to the original data
# paste with id_address for uniqueness
d[d2, on = .(id_address, id_parcel), id := paste0(id_address, mem)]

# if you want a consecutive integer as 'id', to make it agree with 'id_household'
d[ , id2 := as.integer(as.factor(id))]

输出:

d
#     id_address id_owner id_parcel id_household id id2
#  1:          A        1         m            1 A1   1
#  2:          A        1         n            1 A1   1
#  3:          A        2         n            1 A1   1
#  4:          A        2         o            1 A1   1
#  5:          A        3         o            1 A1   1
#  6:          A        4         p            2 A2   2
#  7:          A        5         q            3 A3   3
#  8:          B        6         s            4 B1   4
#  9:          B        7         r            5 B2   5
# 10:          B        8         r            5 B2   5
# 11:          C        9         s            6 C1   6
# 12:          D       10         t            7 D1   7
# 13:          E       11         u            8 E1   8
# 14:          E       11         v            8 E1   8
# 15:          F       12         w            9 F1   9
# 16:          F       13         w            9 F1   9

避免by 操作的替代方案。另一方面增加了一些其他的步骤,因此是否更有效取决于数据的结构。

首先,创建“复合变量”,其中地址分别与宗地和所有者连接。创建membership。通过拆分名称 (tstrsplit(names(mem), "_", fixed = TRUE)) 检索原始列。加入原始数据的成员关系

d[ , `:=`(
  address_parcel = paste(id_address, id_parcel, sep = "_"),
  address_owner = paste(id_address, id_owner, sep = "_"))]

d2 = d[ , {
  g = graph_from_data_frame(.SD[ , .(address_owner, address_parcel)])
  mem = components(g)$membership
  c(tstrsplit(names(mem), "_", fixed = TRUE), .(mem = mem))
}]

d[d2, on = c(id_address = "V1", id_parcel = "V2"), id_hh := mem]

输出:

d
#     id_address id_owner id_parcel id_household id_hh
#  1:          A        1         m            1     1
#  2:          A        1         n            1     1
#  3:          A        2         n            1     1
#  4:          A        2         o            1     1
#  5:          A        3         o            1     1
#  6:          A        4         p            2     2
#  7:          A        5         q            3     3
#  8:          B        6         s            4     4
#  9:          B        7         r            5     5
# 10:          B        8         r            5     5
# 11:          C        9         s            6     6
# 12:          D       10         t            7     7
# 13:          E       11         u            8     8
# 14:          E       11         v            8     8
# 15:          F       12         w            9     9
# 16:          F       13         w            9     9

在更大的数据(原始数据重复1e4 次和在每个块中创建新的 id)上对这两种选择进行计时。对于此特定数据,第二个替代方案(避免使用by)大约快 100 倍。

# prepare toy data
d1 = as.data.table(d)
n = 1e4
dL = d1[rep(1:.N, n)]

# make unique id within the repeated data frames
dL[ , `:=`(
  id_address = paste(rep(1:n, each = nrow(d1)), sep = ".", id_address),
  id_owner = paste(rep(1:n, each = nrow(d1)), sep = ".", id_owner),
  id_parcel = paste(rep(1:n, each = nrow(d1)), sep = ".", id_parcel)
)]

备选方案 1:by 地址

dL1 = copy(dL) 

system.time({
d2 = dL1[ , {
  g = graph_from_data_frame(.SD)
  mem = components(g)$membership
  .(id_parcel = names(mem), mem = mem)
}, by = id_address]

dL1[d2, on = .(id_address, id_parcel), id_hh := paste(id_address, mem, sep = "_")]
})

#  user  system elapsed 
# 59.46    7.68   67.11

备选方案 2. 复合变量和拆分:

dL2 = copy(dL)

system.time({
dL2[ , `:=`(
  address_parcel = paste(id_address, id_parcel, sep = "_"),
  address_owner = paste(id_address, id_owner, sep = "_"))]

d3 = dL2[ , {
  g = graph_from_data_frame(.SD[ , .(address_owner, address_parcel)])
  mem = components(g)$membership
  c(tstrsplit(names(mem), "_", fixed = TRUE), .(mem = mem))
}]

dL2[d3, on = c(id_address = "V1", id_parcel = "V2"), id_hh := mem]
})

# user  system elapsed 
# 0.47    0.24    0.57

相等性测试:

all.equal(as.integer(as.factor(stringi::stri_pad_left(dL1$id_hh, 9, "0"))),
          dL2$id_hh) 
# TRUE

【讨论】:

  • 您的代码在大约 7 分钟内处理了我大约 800 000 行的整个数据集
  • @HugoPérilleuxSanchez 请注意,我使用非by 替代方案进行了更新,该替代方案速度更快,至少在通过重复原始数据组成的玩具数据上。
  • 如果你能在你的真实数据上测试新的替代方案会很有趣。它有效吗?您是否发现了类似的速度增益?干杯
  • 在我的数据集上,大约 5 秒完成。谢谢 !我现在将尝试理解您的代码!
  • 感谢您的反馈!请不要犹豫,询问您是否需要澄清,或者您是否发现任何缺陷。干杯。
【解决方案2】:

一位朋友-同事在 Fortran 中解决了这个问题,速度非常快(排序文件和编译程序后不到一秒):

      program Nohousehold_
      character*1 Passer
      character*13 id_address,id_address1,id_address0,id_addressBL
      character*15 id_owner,id_owner1
      character*23 id_parcel
      character*15 L_owner(1000)
      character*28 L_owner_parcel(1000,1000)
      Integer No_owner, Nbparcel(1000)
      Integer Nohouseholdowner(1000)
      Integer NohouseholdTT
      id_address0='0 0          '
      id_addressBL='             '

      NohouseholdTT=0

      oownern(1,file='data_a.txt')
      oownern(2,file='RESULT3.TXT')
      Read(1,19)Passer             
 19   format(a1)

 81   read(1,11)id_address,id_owner,id_parcel
      if(id_address.eq.id_addressBL)go to 81
      id_owner1='ZZZZZZZZZZZZZZZ'
 82   continue
      if(id_address.eq.id_address0)then
        if(id_owner.ne.id_owner1)then
          NohouseholdTT=NohouseholdTT+1
          write(2,22)id_owner,NohouseholdTT
          id_owner1=id_owner
        endif
        read(1,11)id_address,id_owner,id_parcel
        go to 82
       else
        go to 83
      endif
     
 83   continue

      n=1
 11   format(a13,1x,a15,1x,a28)
 22   format(a15,1x,i10,1x,a28)

      id_address1=id_address
      id_owner1=id_owner
      No_owner=1
      L_owner(1)=id_owner
      Nbparcel(1)=1
      L_owner_parcel(1,1)=id_parcel
      Nohouseholdowner(1)=0
      n=1
      nFini=0
 14   continue
      n=n+1
      read(1,11,end=90)id_address,id_owner,id_parcel
      if(id_address.eq.id_address1)then
         if(id_owner.eq.id_owner1)then
            Nbparcel(No_owner)=Nbparcel(No_owner)+1
            L_owner_parcel(No_owner,Nbparcel(No_owner))=id_parcel
          else
            No_owner=No_owner+1
            L_owner(No_owner)=id_owner
            Nbparcel(No_owner)=1
            L_owner_parcel(No_owner,1)=id_parcel
            Nohouseholdowner(No_owner)=0
            id_owner1=id_owner
         endif
       ELSE
 556     continue
         do 1 Noowner1=1,No_owner 
          if(Nohouseholdowner(Noowner1).eq.0)then
            NohouseholdTT=NohouseholdTT+1
            Nohouseholdowner(Noowner1)=NohouseholdTT
            write(2,22)L_owner(Noowner1),NohouseholdTT
          endif
          do 2 Noowner2=(Noowner1+1),No_owner   
            do 201 Nbparcel1=1,Nbparcel(Noowner1)
              do 202 Nbparcel2=1,Nbparcel(Noowner2)
               if(L_owner_parcel(Noowner1,Nbparcel1).eq.L_owner_parcel(Noowner2,Nbparcel2))then
               if(Nohouseholdowner(Noowner2).eq.0)then
                Nohouseholdowner(Noowner2)=Nohouseholdowner(Noowner1)
                write(2,22)L_owner(Noowner2),NohouseholdTT
                go to 203
               endif
               endif
 202          continue
 201        continue 
 203        continue
 2        continue
 1       continue

         if(NFini.eq.1)go to 91

         id_address1=id_address
         id_owner1=id_owner
         No_owner=1
         L_owner(1)=id_owner
         Nbparcel(1)=1
         L_owner_parcel(1,1)=id_parcel
         Nohouseholdowner(1)=0

      endif
      go to 14
 90   continue
      NFini=1
      go to 556
 91   continue

      close(1)
      close(2)

      end

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 2018-11-02
    • 1970-01-01
    • 2015-03-23
    • 1970-01-01
    • 2021-11-05
    • 2021-04-12
    • 1970-01-01
    • 2017-01-21
    相关资源
    最近更新 更多