【问题标题】:How to speed up the computation of the intersections between each pair of sets for a large number of pairs如何加快计算大量对的每对集合之间的交集
【发布时间】:2023-02-05 05:11:00
【问题描述】:

我有以下数据框:

> str(database)
'data.frame':   8547287 obs. of  4 variables:
 $ cited_id       : num  4.06e+08 5.41e+07 5.31e+07 5.04e+07 3.79e+08 ...
 $ cited_pub_year : num  2014 1989 2002 2002 2015 ...
 $ citing_id      : num  3.34e+08 3.37e+08 4.06e+08 4.19e+08 4.25e+08 ...
 $ citing_pub_year: num  2011 2011 2013 2014 2014 ...

变量cited_idciting_id 包含从中获取此数据库的对象的 ID。

这是数据框的示例:

    cited_id cited_pub_year citing_id citing_pub_year
1  405821349           2014 419185055            2011
2  405821349           1989 336621202            2011
3   53148996           2002 406314162            2013
4   53148996           2002 419185055            2014
5  379369076           2015 424901495            2014
6   53148996           2011 441055669            2015
7  405821349           2014 447519383            2015
8  405821349           2015 469644221            2016
9  329268142           2014 470861263            2016
10  45433355           2008  55422577            2008

例如,ID 405821349 已被 419185055、336621202、447519383 和 469644221 引用。对于每对 ID,我想计算它们的引用 ID 的交集。下面的数量Pj.k就是路口的长度。我尝试使用以下代码

total_id<-c(database$cited_id,database$citing_id)
total_id<-unique(total_id)


df<-data.frame(data_k=character(),data_j=character(),Pj.k=numeric(),
               stringsAsFactors = F)
                            

for (k in 1:(length(total_id)-1)) {
  data_k<-total_id[k]
  citing_data_k<-database[database$cited_id==data_k,]
  
  for (j in (k+1):length(total_id)) {
    data_j<-total_id[j]
    citing_data_j<-database[database$cited_id==data_j,]
    Pj.k<-length(intersect(citing_data_j$citing_id,citing_data_k$citing_id))
    dfxx=data.frame(data_k=data_k,data_j=data_j,Pj.k=Pj.k,
                    stringsAsFactors = F)
    df<-rbind(df,dfxx)
  }
  
}

反正时间太长了!我怎样才能加快速度?

【问题讨论】:

    标签: r performance intersection


    【解决方案1】:

    Count combinations of categorical variables, regardless of order, in R? 中答案的启发,计算对数:

    database = read.table(header = T, stringsAsFactors = F, text = 
    "cited_id cited_pub_year citing_id citing_pub_year
    1  405821349           2014 419185055            2011
    2  405821349           1989 336621202            2011
    3   53148996           2002 406314162            2013
    4   53148996           2002 419185055            2014
    5  379369076           2015 424901495            2014
    6   53148996           2011 441055669            2015
    7  405821349           2014 447519383            2015
    8  405821349           2015 469644221            2016
    9  329268142           2014 470861263            2016
    10  45433355           2008  55422577            2008")
    
    database |>
      dplyr::count(pairs = paste(pmin(cited_id, citing_id), 
                                 pmax(cited_id, citing_id)))
    #>                  pairs n
    #> 1  329268142 470861263 1
    #> 2  336621202 405821349 1
    #> 3  379369076 424901495 1
    #> 4  405821349 419185055 1
    #> 5  405821349 447519383 1
    #> 6  405821349 469644221 1
    #> 7    45433355 55422577 1
    #> 8   53148996 406314162 1
    #> 9   53148996 419185055 1
    #> 10  53148996 441055669 1
    

    根据您的实际需要,您可能会发现 with(database, table(cited_id = cited_id, citing_id = citing_id)) 也很有用。

    【讨论】:

      猜你喜欢
      • 2014-07-29
      • 2018-07-28
      • 1970-01-01
      • 2021-12-10
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2019-07-23
      • 1970-01-01
      相关资源
      最近更新 更多