删除 R 中具有相似（不相同）字符串的行答案

【问题标题】：Remove rows with similar (not identical) strings in R删除 R 中具有相似（不相同）字符串的行
【发布时间】：2021-02-14 00:06:20
【问题描述】：

我有大量 word 文件以文本形式（每个报告在一个单元格中）导入到 r 中，每个主题都有一个 ID。

然后我使用 dplyr 中的 distinct 函数来删除重复的。

然而，有些报告完全相同，但有细微差别（例如，多余/更少的单词、多余的空格等），因此 dplyr 不会将它们视为重复。有没有一种有效的方法来删除 r 中“高度相似”的项目？

这将创建一个示例数据集（与我正在处理的原始数据非常简化：

d = structure(list(ID = 1:8, text = c("The properties of plastics depend on the chemical composition of the subunits, the arrangement of these subunits, and the processing method.", 
                                      "Plastics are usually poor conductors of heat and electricity. Most are insulators with high dielectric strength.", 
                                      "All plastics are polymers but not all polymers are plastic. Plastic polymers consist of chains of linked subunits called monomers. If identical monomers are joined, it forms a homopolymer. Different monomers link to form copolymers. Homopolymers and copolymers may be either straight chains or branched chains.", 
                                      "The properties of plastics depend on the chemical composition of the subunits, the arrangement of these subunits, and the processing method.", 
                                      "Plastics are usually poor conductors of heat and electricity. Most are insulators with high dielectric strength.", 
                                      "All plastics are polymers but not all polymers are plastic. Plastic polymers consist of chains of linked subunits called monomers. If identical monomers are joined, it forms a homopolymer. Different monomers link to form copolymers. Homopolymers and copolymers may be either straight chains or branched chains.", 
                                      "All plastics are polymers however not all polymers are plastic. Plastic polymers consist of chains of linked subunits named monomers. If identical monomers are joined, it forms a homopolymer. Different monomers link to form copolymers. Homopolymers and copolymers may be either straight chains or branched chains.", 
                                      "all plastics are polymers   but not all polymers are plastic. Plastic polymers consist of chains of linked   subunits called monomers. If identical monomers are joined, it forms a homopolymer. Different monomers link to form copolymers. Homopolymers and copolymers may be either straight chains or branched chains."
)), class = "data.frame", row.names = c(NA, -8L))

这是删除完全重复的 dplyr 代码。但是，您会注意到第 2 项、第 7 项和第 8 项几乎相同

library(dplyr)

d %>% 
  distinct(text, .keep_all = T) %>% 
  View()

看起来 dplyr 中有一个 like 函数，但我可以找到如何在此处准确应用它（它似乎也仅适用于短字符串，例如单词）dplyr filter() with SQL-like %wildcard%

另外，有一个包tidystringdist 可以计算 2 个字符串的相似程度，但找不到在此处应用它以删除相似但不相同的项目的方法。 https://cran.r-project.org/web/packages/tidystringdist/vignettes/Getting_started.html

目前有什么建议或指导吗？

更新：

看起来包stringdist 可以按照下面用户的建议解决这个问题。

来自 rstudio 网站的这个问题处理了类似的问题，尽管所需的输出有点不同。我将他们的代码应用到我的数据中，并能够识别出相似的代码。 https://community.rstudio.com/t/identifying-fuzzy-duplicates-from-a-column/35207/2

library(tidystringdist)
library(tidyverse)

# First remove any duplicates: 
d =d %>% 
  distinct(text, .keep_all = T) %>% 
  View()

# this will identify the similar ones and place then in one dataframe called match: 
match <- d %>% 
  tidy_comb_all(text) %>% 
  tidy_stringdist() %>% 
  filter(soundex == 0) %>% # Set a threshold
  gather(x, match, starts_with("V")) %>% 
  .$match

# create negate function of %in%:

 `%!in%` = Negate(`%in%`)

# this will remove those in the `match` out of `d` :
d2 = d %>% 
  filter(text %!in% match) %>% 
  arrange(text)

使用上面的代码，d2 根本没有任何重复/相似的，但我想保留一份副本。

关于如何保留一份副本（例如，仅第一次出现）有什么想法吗？

【问题讨论】：

根据字符串之间的差异，您可能有一些不同的策略。我建议您给我们一些可复制的示例（一组字符串，dput，甚至是reprex 示例）。此外，您可以将字符串转换为列表并检查它们之间的重叠。空格可能很容易处理stringr::str_replace_all(your_string, "\\s+", " ")。
@AurelianoGuedes 感谢您的回复。上面的代表应该作为一个基本的例子。我可以删除多余的空格、新段落、更改为小写等等......正如你提到的那样，这将部分帮助。但是上面的问题是一些额外的单词在字符串中变化太大，无法指定一个捕获所有字符串的正则表达式。我搜索并致力于编写一个有助于完成这项工作的函数，但我想先在这里问一下，因为可能有一个我不知道的通用函数可用。
这是一个相关问题，但不提供详细信息（至少对于普通用户而言）stackoverflow.com/questions/6683380/…
@Bahi8482 这个问题已经 8 岁了。诚然，它是由 R 响应者社区的一位重要成员提出的，但它没有关于当前可用的 R 包的最新答案。

标签： r filter dplyr duplicates similarity

【解决方案1】：

library(stringdist)


dd <- d[ !duplicated( d[['test']] ) , ]
dput(dd)
# --------------
[1] "The properties of plastics depend on the chemical composition of the subunits, the arrangement of these subunits, and the processing method."                                                                                                                                                                              
[2] "Plastics are usually poor conductors of heat and electricity. Most are insulators with high dielectric strength."                                                                                                                                                                                                          
[3] "All plastics are polymers but not all polymers are plastic. Plastic polymers consist of chains of linked subunits called monomers. If identical monomers are joined, it forms a homopolymer. Different monomers link to form copolymers. Homopolymers and copolymers may be either straight chains or branched chains."    
[4] "All plastics are polymers however not all polymers are plastic. Plastic polymers consist of chains of linked subunits named monomers. If identical monomers are joined, it forms a homopolymer. Different monomers link to form copolymers. Homopolymers and copolymers may be either straight chains or branched chains." 
[5] "all plastics are polymers   but not all polymers are plastic. Plastic polymers consist of chains of linked   subunits called monomers. If identical monomers are joined, it forms a homopolymer. Different monomers link to form copolymers. Homopolymers and copolymers may be either straight chains or branched chains."

unname( sapply(dd, stringdist, dd, method="dl") )
#------------------
     [,1] [,2] [,3] [,4] [,5]
[1,]    0  105  231  235  235
[2,]  105    0  234  238  238
[3,]  231  234    0   10    5
[4,]  235  238   10    0   13
[5,]  235  238    5   13    0

距离与字符串长度相关，因此较短的字符串具有较大的最大距离，但对于这种情况，上限 20 似乎就足够了。一个合适的解决方案是使用“距离”与该向量元素的nchar 的某个比率。

不作为最终解决方案提供，而是作为 4 步中的第 1 步和第 2 步提供。

【讨论】：

非常感谢。这绝对是一个有用的步骤。我使用stringdist 添加了一个代码来识别相似的代码。你能想出一种方法在保留第一次出现的同时从数据集中删除它们吗？

【解决方案2】：

我相信这个包就是你要找的东西：fuzzyjoin。

提供了许多模糊距离函数，但如果模糊距离很小，则基本上两个条目是“相似的”。

【讨论】：

非常感谢。如果我找到解决方案，我会研究一下这个包并在这里发布一些代码。
fuzzyjoin 包只是 'stringdist' 包的“管道感知”应用程序。
@IRTFM 实际上我只是在 rstudio 网站community.rstudio.com/t/… 上提出这个问题，该网站使用 stringdist 包来回答类似的问题。如果你熟悉这个包，你能提供一个代码示例来解决我在问题中提到的问题吗？谢谢
我已经使用stringdist 为非重复行创建了一个距离矩阵。但我不想看到一个优雅的解决方案。我可以看到一个笨拙的解决方案，所以我会发布我到目前为止的代码，也许这会有所帮助。
@IRTFM 是的，我同意fuzzyjoin 使用“stringdist”作为后端来计算这些距离，但我认为“fuzzyjoin”在OP 的情况下可能更容易使用。我个人看不到实现 OP 想要的好方法，我猜“fuzzyjoin”及其所有合并功能可能是比使用“stringdist”更好的选择？