【发布时间】:2019-11-26 02:36:45
【问题描述】:
我有一个包含大约 2000 万个观测值的大型数据集。我想计算每一行中 TitleAbstract.x1 和 TitleAbstract.y1 之间的 Jaccard 指数。
这是一个 2 次观察样本:
structure(list(Patent = c(6326004L, 6514936L), TitleAbstract.x = c("mechanical multiplier purpose speed steering control hydrostatic system invention concerned improvement control system hydrostatic drive vehicle comprising pair hydrostatic pumps output adjustable moving arm attached servo valve controlling displacement said pumps, pump powering respective hydraulic motor drives respective ground engaging means said vehicle. improvement present invention mechanically controls speed steering functions system. comprises pair adjusting means, one communicating pumps, comprising frame adjacent pump, first crank mounted centrally frame, first end first crank drivingly linked arm; second crank mounted centrally frame, first end second crank drivingly linked second end first crank third crank mounted centrally frame, first end third crank drivingly linked second end first crank second end third crank drivingly linked steering linkage means. improved arrangement includes tying means drivingly mounted adjacent second end second cranks linking movement thereof.",
"mechanical multiplier purpose speed steering control hydrostatic system invention concerned improvement control system hydrostatic drive vehicle comprising pair hydrostatic pumps output adjustable moving arm attached servo valve controlling displacement said pumps, pump powering respective hydraulic motor drives respective ground engaging means said vehicle. improvement present invention mechanically controls speed steering functions system. comprises pair adjusting means, one communicating pumps, comprising frame adjacent pump, first crank mounted centrally frame, first end first crank drivingly linked arm; second crank mounted centrally frame, first end second crank drivingly linked second end first crank third crank mounted centrally frame, first end third crank drivingly linked second end first crank second end third crank drivingly linked steering linkage means. improved arrangement includes tying means drivingly mounted adjacent second end second cranks linking movement thereof."
), cited = c(4261928L, 4261928L), TitleAbstract.y = c("antiviral methods using fragments human rhinovirus receptor (icam-1) ",
"antiviral methods using human rhinovirus receptor (icam-1) method substantially inhibiting initiation spread infection rhinovirus coxsackie virus host cells expressing major human rhinovirus receptor (icam-1), comprising step contacting virus soluble polypeptide comprising hrv binding site domains ii icam-1; polypeptide capable binding virus reducing infectivity thereof; contact conditions permit virus bind polypeptide."
), Jaccard = c(0, 0.00909090909090909)), row.names = c(NA, -2L
), class = c("data.table", "data.frame"), .internal.selfref = <pointer: 0x7f9c8f801778>, sorted = "cited", .Names = c("Patent",
"TitleAbstract.x", "cited", "TitleAbstract.y", "Jaccard"))
在之前的帖子之后,我使用自制的equation 来计算 Jaccard 指数,并创建了一个 function 然后使用 Mapply 运行,但我收到错误 'this is not a function'。
Jaccard_Index <- function(x,y)
{
return(mapply(length(intersect(unlist(strsplit(df$TitleAbstract.x1, "\\s+")),unlist(strsplit(df$TitleAbstract.y1, "\\s+")))) / length(union(unlist(strsplit(df$TitleAbstract.x1, "\\s+")),unlist(strsplit(df$TitleAbstract.y1, "\\s+"))))))
}
mapply(Jaccard_Index,df$TitleAbstract.x1,df$TitleAbstract.y1)
我尝试将TitleAbstract.x1 和TitleAbstract.y1 更改为x 和y,但仍然出现同样的错误。
这可能是一个新手问题,但谁能帮我写出正确的函数?
另外,我还有两个问题:
Q2如何使用parallel & mcapply 来加快这个过程?
Q3 R 在内存存储和速度方面的限制是什么,您是否建议使用不同的方法(即通过 bash 使用 python)来处理长时间和内存密集型的进程?
编辑
我上传了正确的数据集,我必须更新我的 RStudio 以避免被截断的数据集。
【问题讨论】:
-
您发布的数据无效,已被截断。可能最好对其进行子集化。
-
您将函数命名为
Jaccard_Index,但您没有在mapply()调用中使用该名称。