【发布时间】:2016-02-04 01:16:55
【问题描述】:
我编写了一个非常不像 R 的代码来进行一些相对简单的计算,但是虽然代码似乎可以工作,但它的效率非常低,而且我的计算机的 RAM 无法完成。
我有一个名为dat 的data.frame 有四列,包括firm_id (character)、pnum (numeric)、class (character) 和date ( Date)。我总共有 100,000 行:200 个不同的 firm_id、90,000 个唯一的 pnum、大约 31,000 个唯一的 class 和大约一样多的 date。
每个pnum 对单个firm_id 都是唯一的。 pnum 分配给多个 class 元素(因此如果有 5 个 class 元素,则 pnum 在 data.frame 中重复 5 次。date 不会因 pnum 而变化,而是多个 @987654343同一个firm_id的@可以在同一天。有zero NAs。
这是一个简短的dput
dput(dat[1:50])
structure(list(firm_id = c("A&O", "A&O", "A&O", "A&O", "A&O","A&O", "A&O",
"A&O", "A&O", "A&O", "A&O", "A&O", "A&O", "A&O",
"A&O", "A&O", "A&O", "A&O", "A&O", "A&O", "A&O",
"A&O", "A&O", "A&O", "A&O", "A&O", "A&O", "A&O",
"A&O", "A&O", "A&O", "A&O", "A&O", "A&O", "AAT",
"AAT", "AAT", "AAT", "AAT", "AAT", "AAT", "AAT",
"AAT", "AAT", "AAT", "AAT", "AAT", "AAT", "AAT", "AAT"),
pnum = c(5259588, 5259588, 5259588, 5259588, 7049668, 7049668,
7049668, 7049668, 7049668, 7049668, 7049668, 7183616,
7183616, 7183616, 7183616, 7183616, 7183616, 7183616,
7183616, 7208818, 7208818, 7208818, 7208818, 7208818,
7208818, 7208818, 7122882, 7122882, 7122882, 7122882,
7122882, 7122882, 7122882, 7122882, 6413822, 6413822,
6413822, 6413822, 6413822, 6413822, 6413822, 6413822, 6413822,
6750507, 6750507, 6750507, 6750507, 6750507, 6750507, 6750507),
class = c("184/1.5", "222/110", "251/100", "251/324", "257/330",
"257/331", "257/401", "257/E29.027", "257/E29.134",
"257/E29.136", "257/E29.146", "257/341", "257/342",
"257/343", "257/401", "257/723", "257/E23.052",
"257/E23.079", "257/E25.016", "257/666", "257/676",
"257/E23.024", "257/E23.026", "257/E23.037", "257/E23.044",
"257/E23.052", "257/48", "257/666", "257/723", "257/778",
"257/E23.052", "257/E25.013", "257/E25.029", "257/E29.267",
"257/331", "257/E29.027", "257/E29.066", "257/E29.133",
"257/E29.146", "438/268", "438/270", "438/272", "438/430",
"257/302", "257/328", "257/330", "257/396",
"257/E29.027", "257/E29.066", "257/E29.133"),
date = structure(c(7953, 7953, 7953, 7953, 10463, 10463,
10463, 10463, 10463, 10463, 10463, 11777,
11777, 11777, 11777, 11777, 11777, 11777,
11777, 12619, 12619, 12619, 12619, 12619,
12619, 12619, 12724, 12724, 12724, 12724,
12724, 12724, 12724, 12724, 10703, 10703,
10703, 10703, 10703, 10703, 10703, 10703,
10703, 10703, 10703, 10703, 10703, 10703,
10703, 10703), class = "Date")),
.Names = c("firm_id", "pnum", "class", "date"),
class = c("data.table", "data.frame"),
row.names = c(NA, -50L))
代码旨在执行以下操作:
对于每一个firm_id:
- 查看每个
pnum的class元素。 - 将
class元素与相同firm_id的每个pnum的class元素进行比较,前提是dates(与各自的pnum相关联)之间的差距是被比较的时间小于 5 年。 [用大写字母添加,以避免混淆。下面 Michael 提供的解决方案将焦点pnum与公司前五年的投资组合pnum进行了比较]
我当前的代码如下所示:(我知道,相信我我知道)
#Step 1: Create a vector of unique firms and a data.frame with all
# `pnum`, `firm_id`, and `date` but without the `class` data
firms <- (unique(dat$firm_id))
patents <- data.frame(unique(dat$pnum))
patents$id <- dat$firm_id[match(patents$unique.dat.pnum, dat$pnum)]
patents$date <- dat$date[match(patents$unique.dat.pnum, dat$pnum)]
colnames(patents) <- c("pnum", "id", "date")
#Step 2: Set-up variables needed to store the results
library(gtools)
startrow <- 0
df <- data.frame()
#Step 3: Loop around all firms
for(i in 1:length(firms)){
startrow <- startrow + length(patents$id[patents$id == firms[i - 1]])
subdat <- dat[dat$firm_id == firms[i]]
subpat <- unique(subdat$pnum)
dt <- data.frame()
#Step 4: Find which of the `pnum` fit within the 5 year time frame
for(j in 1:length(subpat)){ # Number of unique patents in subdat
class.now <- subdat$class[subdat$pnum == subpat[j]]
ref.pat <- unique(subdat$pnum[(subdat$date > (patents$date[startrow + j] - 5*365) & subdat$date < (patents$date[startrow + j]))])
if (invalid(ref.pat) == T ) ref.pat <- NA
m <- data.frame(cbind(orig.pat = rep(patents$pnum[startrow + j],length(ref.pat))),ref.pat = NA, jac = NA)
#Step 5: Compare the focal `pnum` with each of the prior ones within the
# 5 year time frame and calculate a Jaccard index
for(k in 1:length(ref.pat)){
class.ref <- subdat$class[subdat$pnum == ref.pat[k]]
m$ref.pat[k] <- ref.pat[k]
m$jac[k] <- sum(class.now %in% class.ref)/(length(class.now) + length(class.ref) - length(class.now %in% class.ref))}
dt <- data.frame(rbind(dt, m)) ; rm(m)
}
df <- data.frame(rbind(df, dt))
rm(dt) ; print(i)
}
编辑 1:代码生成一个 data.frame df,第一列 origpat 包含原始 pnum,第二列 ref.pat 包含与 orig.pat 进行比较的专利,并且第三列 jac orig.pat 和 ref.pat 的 Jaccard 索引。
非常欢迎任何关于使这项工作更好的建议!
【问题讨论】:
-
这个问题本身没有问题,但我投票将其移交给代码审查,在那里它可能会得到更好的接受。
-
非常有意义。老实说,我不知道有一个单独的代码审查论坛......
标签: r for-loop data.table dplyr