“R”ify 代码以节省内存和 CPU 功率和/或如何显示代码执行进度？ [关闭]答案

【问题标题】："R"ify code to save memory and CPU power and / or how to show progress on code execution? [closed]“R”ify 代码以节省内存和 CPU 功率和/或如何显示代码执行进度？ [关闭]
【发布时间】：2016-02-04 01:16:55
【问题描述】：

我编写了一个非常不像 R 的代码来进行一些相对简单的计算，但是虽然代码似乎可以工作，但它的效率非常低，而且我的计算机的 RAM 无法完成。

我有一个名为dat 的data.frame 有四列，包括firm_id (character)、pnum (numeric)、class (character) 和date ( Date)。我总共有 100,000 行：200 个不同的 firm_id、90,000 个唯一的 pnum、大约 31,000 个唯一的 class 和大约一样多的 date。

每个pnum 对单个firm_id 都是唯一的。 pnum 分配给多个 class 元素（因此如果有 5 个 class 元素，则 pnum 在 data.frame 中重复 5 次。date 不会因 pnum 而变化，而是多个 @987654343同一个firm_id的@可以在同一天。有zero NAs。

这是一个简短的dput

dput(dat[1:50])
structure(list(firm_id = c("A&O", "A&O", "A&O", "A&O", "A&O","A&O", "A&O",
                           "A&O", "A&O", "A&O", "A&O", "A&O", "A&O", "A&O", 
                           "A&O", "A&O", "A&O", "A&O", "A&O", "A&O", "A&O", 
                           "A&O", "A&O", "A&O", "A&O", "A&O", "A&O", "A&O",
                           "A&O", "A&O", "A&O", "A&O", "A&O", "A&O", "AAT", 
                           "AAT", "AAT", "AAT", "AAT", "AAT", "AAT", "AAT", 
                           "AAT", "AAT", "AAT", "AAT", "AAT", "AAT", "AAT", "AAT"), 
               pnum = c(5259588, 5259588, 5259588, 5259588, 7049668, 7049668,
                        7049668, 7049668, 7049668, 7049668, 7049668, 7183616, 
                        7183616, 7183616, 7183616, 7183616, 7183616, 7183616,
                        7183616, 7208818, 7208818, 7208818, 7208818, 7208818,
                        7208818, 7208818, 7122882, 7122882, 7122882, 7122882, 
                        7122882, 7122882, 7122882, 7122882, 6413822, 6413822, 
                        6413822, 6413822, 6413822, 6413822, 6413822, 6413822, 6413822,
                        6750507, 6750507, 6750507, 6750507, 6750507, 6750507, 6750507),
               class = c("184/1.5", "222/110", "251/100", "251/324", "257/330",
                         "257/331", "257/401", "257/E29.027", "257/E29.134",
                         "257/E29.136", "257/E29.146", "257/341", "257/342", 
                         "257/343", "257/401", "257/723", "257/E23.052", 
                         "257/E23.079", "257/E25.016", "257/666", "257/676", 
                         "257/E23.024", "257/E23.026", "257/E23.037", "257/E23.044",
                         "257/E23.052", "257/48", "257/666", "257/723", "257/778", 
                         "257/E23.052", "257/E25.013", "257/E25.029", "257/E29.267",
                         "257/331", "257/E29.027", "257/E29.066", "257/E29.133",
                         "257/E29.146", "438/268", "438/270", "438/272", "438/430",
                         "257/302", "257/328", "257/330", "257/396",
                         "257/E29.027", "257/E29.066", "257/E29.133"), 
               date = structure(c(7953, 7953, 7953, 7953, 10463, 10463, 
                                  10463, 10463, 10463, 10463, 10463, 11777,
                                  11777, 11777, 11777, 11777, 11777, 11777,
                                  11777, 12619, 12619, 12619, 12619, 12619, 
                                  12619, 12619, 12724, 12724, 12724, 12724,
                                  12724, 12724, 12724, 12724, 10703, 10703,
                                  10703, 10703, 10703, 10703, 10703, 10703, 
                                  10703, 10703, 10703, 10703, 10703, 10703,
                                  10703, 10703), class = "Date")), 
          .Names = c("firm_id", "pnum", "class", "date"),
          class = c("data.table", "data.frame"), 
          row.names = c(NA, -50L))

代码旨在执行以下操作：

对于每一个firm_id：

查看每个pnum 的class 元素。
将class 元素与相同firm_id 的每个pnum 的class 元素进行比较，前提是dates（与各自的pnum 相关联）之间的差距是被比较的时间小于 5 年。 [用大写字母添加，以避免混淆。下面 Michael 提供的解决方案将焦点 pnum 与公司前五年的投资组合 pnum 进行了比较]

我当前的代码如下所示：（我知道，相信我我知道）

#Step 1: Create a vector of unique firms and a data.frame with all
#  `pnum`, `firm_id`, and `date` but without the `class` data

firms <- (unique(dat$firm_id))
patents <- data.frame(unique(dat$pnum))
patents$id <- dat$firm_id[match(patents$unique.dat.pnum, dat$pnum)]
patents$date <- dat$date[match(patents$unique.dat.pnum, dat$pnum)]
colnames(patents) <- c("pnum", "id", "date")

#Step 2: Set-up variables needed to store the results 

library(gtools)
startrow <- 0
df <- data.frame()

#Step 3: Loop around all firms

for(i in 1:length(firms)){
  startrow <- startrow + length(patents$id[patents$id == firms[i - 1]])
  subdat <- dat[dat$firm_id == firms[i]]
  subpat <- unique(subdat$pnum)
  dt <- data.frame()

#Step 4: Find which of the `pnum` fit within the 5 year time frame
    for(j in 1:length(subpat)){    # Number of unique patents in subdat

      class.now <- subdat$class[subdat$pnum == subpat[j]]
      ref.pat <- unique(subdat$pnum[(subdat$date > (patents$date[startrow + j] - 5*365) & subdat$date < (patents$date[startrow + j]))])
      if (invalid(ref.pat) == T ) ref.pat <- NA
          m <- data.frame(cbind(orig.pat = rep(patents$pnum[startrow + j],length(ref.pat))),ref.pat = NA, jac = NA)

#Step 5: Compare the focal `pnum` with each of the prior ones within the
#  5 year time frame and calculate a Jaccard index

              for(k in 1:length(ref.pat)){
                class.ref <- subdat$class[subdat$pnum == ref.pat[k]]
                m$ref.pat[k] <- ref.pat[k]
                m$jac[k] <- sum(class.now %in% class.ref)/(length(class.now) + length(class.ref) - length(class.now %in% class.ref))}
            dt <- data.frame(rbind(dt, m)) ; rm(m)           
        } 
    df <- data.frame(rbind(df, dt))
    rm(dt) ; print(i)
  }

编辑 1：代码生成一个 data.frame df，第一列 origpat 包含原始 pnum，第二列 ref.pat 包含与 orig.pat 进行比较的专利，并且第三列 jac orig.pat 和 ref.pat 的 Jaccard 索引。

非常欢迎任何关于使这项工作更好的建议！

【问题讨论】：

这个问题本身没有问题，但我投票将其移交给代码审查，在那里它可能会得到更好的接受。
非常有意义。老实说，我不知道有一个单独的代码审查论坛......

标签： r for-loop data.table dplyr

【解决方案1】：

我相信这是正确的，我承认我在您的原始代码和描述中迷失了方向。

# helper function for conciseness below
jac <- function(cn, cr) sum(idx <- cn %in% cr) /
  (length(cn) + length(cr) - length(idx))

setkey(dat, pnum, firm_id) #for faster exclusive subsetting

dat[ , {x<-date[1]; cs <- class #assign these so there's no scoping issue below
#Now that we're within a (firm_id, pnum) subset,
#  we go back to the original table and subset to
#  the _same_ firm but NOT the same patent
#  (note that the current firm and pnum are stored
#   in .BY[[1]] and .BY[[2]], respectively)
dat[firm_id == .BY[[1]] & pnum != .BY[[2]]
      #having subsetted to everything by the same firm
      #  (except things with the same `pnum`), we
      #  check which `pnum` satisfy the within-five-years
      #  criterium; for those that do, we calculate `jac`
      ][abs(date - x) <= 365 * 5,  jac(cs, class)]}, 
by = .(firm_id, pnum)] #we do this for each `firm_id` and `pnum`

【讨论】：

感谢 Michael，稍后会尝试运行它，但我想先了解一件事（感谢编辑！）：结果存储在哪里？我的代码（我忘了描述）的结果创建了一个新的 data.frame df，其中焦点专利和与之比较的专利存储在它们各自的 Jaccard 索引旁边的矩阵中。我的代码目前正在运行，但仅在一个多小时内完成了 8 个firm_id。
@SJDS 你可以简单地分配这个操作的结果 -- prepend result <- dat[...]
@SJDS 你说df 有“焦点”专利和对比专利？我想我误解了，然后（我自己对 Jaccard 索引是什么的天真）。现有代码将每个公司的专利与五年内的所有其他专利作为一个整体（而不是一个接一个）进行比较。我不认为编码的区别太难了。你的目标是一对一？
是的，您的代码刚刚运行完毕，我注意到它确实对焦点专利和过去五年的投资组合进行了比较。我编写了另一个代码（带有 for 循环），它像您一样为投资组合计算它，并存储重叠类的数量、焦点专利中的类数以及投资组合中的类数。令我惊讶的是，我的双 for 循环运行与投资组合建立比较所需的时间要少得多。通过我的数据集运行您的代码大约需要 90 分钟...
pnum 与 pnum 的比较在我看来更加困难，因为计算限制。对于一个 firm_id，dat 中的最高行数刚刚超过 80,000（类总数），对于大约 15,000 个唯一 pnum。因此，pnum by pnum 数据帧呈指数增长。一个好的解决方案是自动将 Jaccard 为零的所有行重新整形为一条只有焦点 pnum 的行，并计算其他 pnum 的数量，Jaccard 为零