从相似字符串的向量中获取唯一的字符串答案

【问题标题】：Get unique string from a vector of similar strings从相似字符串的向量中获取唯一的字符串
【发布时间】：2014-08-22 12:41:06
【问题描述】：

我不太知道如何表达这个问题。我刚刚开始处理一堆推文，我已经做了一些基本的清理工作，现在一些推文看起来像：

x <- c("stackoverflow is a great site",
"stackoverflow is a great si",
"stackoverflow is a great",
"omg it is friday and so sunny",
"omg it is friday and so",
"arggh how annoying")

基本上，我想通过检查字符串的第一部分是否匹配并返回最长的字符串来删除重复项。在这种情况下，我的结果应该是：

[1]"stackoverflow is a great site"
[2]"omg it is friday and so sunny"
[3]"arggh how annoying"

因为所有其他内容都是上述内容的截断重复。我试过使用 unique() 函数，但它不返回我想要的结果，因为它试图匹配字符串的整个长度。请大家指点一下？

我在 Mac OSX 10.7 上使用 R 版本 3.1.1...

谢谢！

【问题讨论】：

标签： r string unique

【解决方案1】：

这是另一种选择。我已在您的示例数据中添加了一个字符串。

x <- c("stackoverflow is a great site",
"stackoverflow is a great si",
"stackoverflow is a great",
"stackoverflow is an OK site",
"omg it is friday and so sunny",
"omg it is friday and so",
"arggh how annoying")

Filter(function(y) {
    x2 <- sapply(setdiff(x, y), substr, start=1, stop=nchar(y))
    ! duplicated(c(y, x2), fromLast=TRUE)[1]
}, x)


# [1] "stackoverflow is a great site" "stackoverflow is an OK site"   "omg it is friday and so sunny" [4] "arggh how annoying"

【讨论】：

谢谢 Matthew 我认为它工作正常，只是看起来很复杂我不明白那里发生了什么，你如何让它不把 stackoverflow is a great site 和 stackoverflow is an OK site 视为一个？抱歉谢谢
我误解了你的问题。所以你只想匹配第一个单词？
不不不，你没有误解它完美地工作我的意思是你如何让它区分stackoverflow is a great site和stackoverflow is an ok site?只是想知道你的代码发生了什么，但是确实解决了问题
这个想法是对于每个字符串，将所有其他字符串缩小到相同的长度。如果有重复，那么您知道当前字符串的较长版本存在，并且您拒绝当前字符串。如果没有重复，那么当前字符串是最长的，所以你接受那个。

【解决方案2】：

这是我的尝试：

library(stringr)
x[!sapply(seq_along(x), function(i) any(str_detect(x[-i], x[i])))]
[1] "stackoverflow is a great site" "omg it is friday and so sunny" "arggh how annoying"

基本上，我排除了那些已经包含在其他字符串中的字符串。这可能与您描述的有点不同，但大致相同并且非常简单。

【讨论】：

感谢托尼，它似乎可以工作，但对于较短的字符串，我的向量有大约 1000 个字符串，并将它们剥离到 650 个。我认为没有那么多重复项要剥离，但我必须手动检查以获得想法
您可以使用sort(x[sapply(seq_along(x), function(i) any(str_detect(x[-i], x[i])))])查看这些重复项。
看起来我有很多重复项，你是对的，谢谢这个作品，能够看到它过滤掉的内容很酷，谢谢

【解决方案3】：

@tonytonov 解决方案很好，但我建议使用stringi package :)

stringi <- function(x){
  x[!sapply(seq_along(x), function(i) any(stri_detect_fixed(x[-i], x[i])))]
}

stringr <- function(x){
  x[!sapply(seq_along(x), function(i) any(str_detect(x[-i], x[i])))]
}

require(microbenchmark)
microbenchmark(stringi(x), stringr(x))
Unit: microseconds
       expr     min       lq   median       uq      max neval
 stringi(x)  52.482  58.1760  64.3275  71.9630  120.374   100
 stringr(x) 538.482 551.0485 564.3445 602.7095 1736.601   100

【讨论】：