【发布时间】:2021-02-14 00:06:20
【问题描述】:
我有大量 word 文件以文本形式(每个报告在一个单元格中)导入到 r 中,每个主题都有一个 ID。
然后我使用 dplyr 中的 distinct 函数来删除重复的。
然而,有些报告完全相同,但有细微差别(例如,多余/更少的单词、多余的空格等),因此 dplyr 不会将它们视为重复。有没有一种有效的方法来删除 r 中“高度相似”的项目?
这将创建一个示例数据集(与我正在处理的原始数据非常简化:
d = structure(list(ID = 1:8, text = c("The properties of plastics depend on the chemical composition of the subunits, the arrangement of these subunits, and the processing method.",
"Plastics are usually poor conductors of heat and electricity. Most are insulators with high dielectric strength.",
"All plastics are polymers but not all polymers are plastic. Plastic polymers consist of chains of linked subunits called monomers. If identical monomers are joined, it forms a homopolymer. Different monomers link to form copolymers. Homopolymers and copolymers may be either straight chains or branched chains.",
"The properties of plastics depend on the chemical composition of the subunits, the arrangement of these subunits, and the processing method.",
"Plastics are usually poor conductors of heat and electricity. Most are insulators with high dielectric strength.",
"All plastics are polymers but not all polymers are plastic. Plastic polymers consist of chains of linked subunits called monomers. If identical monomers are joined, it forms a homopolymer. Different monomers link to form copolymers. Homopolymers and copolymers may be either straight chains or branched chains.",
"All plastics are polymers however not all polymers are plastic. Plastic polymers consist of chains of linked subunits named monomers. If identical monomers are joined, it forms a homopolymer. Different monomers link to form copolymers. Homopolymers and copolymers may be either straight chains or branched chains.",
"all plastics are polymers but not all polymers are plastic. Plastic polymers consist of chains of linked subunits called monomers. If identical monomers are joined, it forms a homopolymer. Different monomers link to form copolymers. Homopolymers and copolymers may be either straight chains or branched chains."
)), class = "data.frame", row.names = c(NA, -8L))
这是删除完全重复的 dplyr 代码。但是,您会注意到第 2 项、第 7 项和第 8 项几乎相同
library(dplyr)
d %>%
distinct(text, .keep_all = T) %>%
View()
看起来 dplyr 中有一个 like 函数,但我可以找到如何在此处准确应用它(它似乎也仅适用于短字符串,例如单词)dplyr filter() with SQL-like %wildcard%
另外,有一个包tidystringdist 可以计算 2 个字符串的相似程度,但找不到在此处应用它以删除相似但不相同的项目的方法。
https://cran.r-project.org/web/packages/tidystringdist/vignettes/Getting_started.html
目前有什么建议或指导吗?
更新:
看起来包stringdist 可以按照下面用户的建议解决这个问题。
来自 rstudio 网站的这个问题处理了类似的问题,尽管所需的输出有点不同。我将他们的代码应用到我的数据中,并能够识别出相似的代码。 https://community.rstudio.com/t/identifying-fuzzy-duplicates-from-a-column/35207/2
library(tidystringdist)
library(tidyverse)
# First remove any duplicates:
d =d %>%
distinct(text, .keep_all = T) %>%
View()
# this will identify the similar ones and place then in one dataframe called match:
match <- d %>%
tidy_comb_all(text) %>%
tidy_stringdist() %>%
filter(soundex == 0) %>% # Set a threshold
gather(x, match, starts_with("V")) %>%
.$match
# create negate function of %in%:
`%!in%` = Negate(`%in%`)
# this will remove those in the `match` out of `d` :
d2 = d %>%
filter(text %!in% match) %>%
arrange(text)
使用上面的代码,d2 根本没有任何重复/相似的,但我想保留一份副本。
关于如何保留一份副本(例如,仅第一次出现)有什么想法吗?
【问题讨论】:
-
根据字符串之间的差异,您可能有一些不同的策略。我建议您给我们一些可复制的示例(一组字符串,
dput,甚至是reprex 示例)。此外,您可以将字符串转换为列表并检查它们之间的重叠。空格可能很容易处理stringr::str_replace_all(your_string, "\\s+", " ")。 -
@AurelianoGuedes 感谢您的回复。上面的代表应该作为一个基本的例子。我可以删除多余的空格、新段落、更改为小写等等......正如你提到的那样,这将部分帮助。但是上面的问题是一些额外的单词在字符串中变化太大,无法指定一个捕获所有字符串的正则表达式。我搜索并致力于编写一个有助于完成这项工作的函数,但我想先在这里问一下,因为可能有一个我不知道的通用函数可用。
-
这是一个相关问题,但不提供详细信息(至少对于普通用户而言)stackoverflow.com/questions/6683380/…
-
@Bahi8482 这个问题已经 8 岁了。诚然,它是由 R 响应者社区的一位重要成员提出的,但它没有关于当前可用的 R 包的最新答案。
标签: r filter dplyr duplicates similarity