我可以进一步矢量化这个函数吗答案

【问题标题】：Can I further vectorize this function我可以进一步矢量化这个函数吗
【发布时间】：2017-02-14 20:09:33
【问题描述】：

我对 R 和基于矩阵的脚本语言相对较新。我编写了这个函数来返回每行的索引，其内容类似于任何另一行的内容。这是我正在开发的一种减少垃圾邮件的原始形式。

if (!require("RecordLinkage")) install.packages("RecordLinkage")

library("RecordLinkage")

# Takes a column of strings, returns a list of index's
check_similarity <- function(x) {
  threshold <- 0.8
  values <- NULL
  for(i in 1:length(x)) {
    values <- c(values, which(jarowinkler(x[i], x[-i]) > threshold))
  }
  return(values)
}

有没有办法我可以写这个来完全避免 for 循环？

【问题讨论】：

@akrun 更新了，干杯
@d.b 不，我正在与所有其他行进行比较，x[i]，x[-i]
也许试试这个：m = as.matrix(sapply(x, jarowinkler, x)) > threshold; diag(m) = 0; which(rowSums(m)>0) 没有可重复的数据可供我测试，但我认为这可行。
@dww 效果很好，正是我想要的，干杯。如果您回答问题，我会将其标记为正确。
请注意，您的代码中的主要低效率不是您有 for 循环，而是您的增长了一个向量 在 for 循环中。有关扩展讨论，请参阅The R Inferno。 sapply 很好地解决了这个问题：它会将值放在预先分配的 list 中，然后为您简化它，但在效率方面，您也可以修改您的 for 循环。

标签： r for-loop vectorization

【解决方案1】：

我们可以使用sapply 稍微简化代码。

# some test data #
x = c('hello', 'hollow', 'cat', 'turtle', 'bottle', 'xxx')

# create an x by x matrix specifying which strings are alike 
m = sapply(x, jarowinkler, x) > threshold

# set diagonal to FALSE: we're not interested in strings being identical to themselves
diag(m) = FALSE

# And find index positions of all strings that are similar to at least one other string
which(rowSums(m) > 0)
# [1] 1 2 4 5

即这将返回“hello”、“hollow”、“turtle”和“bottle”的索引位置，因为它们与另一个字符串相似

如果您愿意，可以使用 colSums 而不是 rowSums 来获取命名向量，但如果字符串很长，这可能会很混乱：

which(colSums(m) > 0)
# hello hollow turtle bottle 
#     1      2      4      5

【讨论】：