【问题标题】:Slow loop R, how make it faster?慢循环R,如何让它更快?
【发布时间】:2015-05-19 00:02:35
【问题描述】:

我有一个电子邮件列表,我想使用最长公共子字符串比较行之间的模式(相似性)来比较它们。

data 是带有电子邮件的数据框:

           V1
1   "01003@163.com"
2   "cloud@coldmail.com"
3   "den_smukk_kiilar@hotmail.com"
4   "Esteban.verduzco@gmail.com"
5   "freiheitmensch@gmail.com"
6   "mitsoanastos@yahoo.com"
7   "ahmedsir744@yahoo.com" 
8   ...

这是我的代码:

library(stringdist)

for(i in 1:nrow(data)) {
      sample <- data[i,]
      for(j in (i+1):nrow(data)) if(i+1 <= nrow(data)) {
        if((stringdist(data[j,],sample,method='lcs'))<=3) {  #number of different characteres 3 (123.456 == 123.321)
          duplicate <- data[j,]
          email1 = as.character(data[i,])
          email2 = as.character(data[j,])
          pair <- cbind(email1, email2)
          output3[dfrow, ] <- pair
          dfrow <- dfrow + 1
        }
      }
    }

“输出”是显示相似电子邮件的数据框。

         email1          email2
1   "01079@163.com" "01069@163.com"

我有 30 万封电子邮件,这将需要很长时间......

有没有更好的方法?

谢谢!

【问题讨论】:

    标签: r for-loop pattern-matching nested-loops reshape


    【解决方案1】:

    这是一个尝试:

    library(stringdist)
    library(stringi)
    library(dplyr)
    library(tidyr)
    
    # Hypothetical data frame     
    data <- data.frame(V1 = paste0(stri_rand_strings(5, 3, "[a-z]"), 
                                   "@", stri_rand_strings(5, 2, "[a-z]"), ".com"), 
                       stringsAsFactors = FALSE)
    

    基本上,您创建一个字符串距离成对矩阵,将其包装在数据框中,将所有等于或小于 3 的字符串距离替换为相应的 V1 值,其余的替换为 NA。然后,您删除现在不需要的V1 列,gather() 整齐格式的数据并删除NAs。

    data %>%
      data.frame(stringdistmatrix(.$V1, .$V1, useNames = TRUE, method = "lcs"), 
                 row.names = NULL) %>%
    
    #          V1 wnw.fa.com kty.hm.com brs.wk.com pib.uo.com ryu.iq.com
    #1 wnw@fa.com          0         10         10         10         10
    #2 kty@hm.com         10          0         10         10          8
    #3 brs@wk.com         10         10          0          8          8
    #4 pib@uo.com         10         10          8          0         10
    #5 ryu@iq.com         10          8          8         10          0
    
      # here you need to replace '8' by '3' for your example
      mutate_each(funs(ifelse(. <= 8 & . != 0, V1, NA)), -V1) %>% 
    
    #          V1 wnw.fa.com kty.hm.com brs.wk.com pib.uo.com ryu.iq.com
    #1 wnw@fa.com         NA       <NA>       <NA>       <NA>       <NA>
    #2 kty@hm.com         NA       <NA>       <NA>       <NA> kty@hm.com
    #3 brs@wk.com         NA       <NA>       <NA> brs@wk.com brs@wk.com
    #4 pib@uo.com         NA       <NA> pib@uo.com       <NA>       <NA>
    #5 ryu@iq.com         NA ryu@iq.com ryu@iq.com       <NA>       <NA>
    
      select(-V1) %>%
      gather(email1, email2) %>%
      na.omit() %>%
      mutate(email1 = stri_replace_first(email1, fixed = ".", "@"))
    

    这给出了:

    #      email1     email2
    #1 kty@hm.com ryu@iq.com
    #2 brs@wk.com pib@uo.com
    #3 brs@wk.com ryu@iq.com
    #4 pib@uo.com brs@wk.com
    #5 ryu@iq.com kty@hm.com
    #6 ryu@iq.com brs@wk.com
    

    【讨论】:

    • 感谢 Steven,但此解决方案不适用于 300k 的电子邮件列表。内存不足...
    猜你喜欢
    • 2021-09-25
    • 1970-01-01
    • 2015-07-04
    • 1970-01-01
    • 2017-08-23
    • 1970-01-01
    • 1970-01-01
    • 2023-02-16
    • 2010-10-24
    相关资源
    最近更新 更多