【问题标题】：Remove duplicate tweets that are 90% similar删除 90% 相似的重复推文
【发布时间】：2020-06-07 01:38:41
【问题描述】：

我已经提取了推文，我想删除重复的推文。如果我使用 padas drop_duplicates(inplace=True) 它将删除 100% 重复的推文。我想知道有没有办法去除彼此之间略有不同但 90% 相似的地方。

示例
今年什么时候结束？只有痛苦和坏东西！我讨厌 2020 年！
今年什么时候结束？只有痛苦和坏事！我讨厌 2020 年！

这些推文几乎相似，我该如何删除它们

【问题讨论】：

this 应该会有所帮助。一旦您设法获得类似的值（介于 0 和 1 之间），请设置一个条件来检查高于 0.9 的值并删除这些推文。
这会有帮助吗？ stackoverflow.com/questions/62106645/…

标签： python pandas nlp

【解决方案1】：

您的问题没有简单的答案，但一些幼稚的解决方案可能类似于以下内容。

方法 1

首先，您需要定义一个相似度指标。一个常见的（基于字符的）字符串比较指标是Levenshtein 距离，但我建议查看fuzzywuzzy 的自述文件以找到适合您的用例的指标。对于这个微演示，我使用python-levenshtein，而不是使用†hefuzzywuzzy包。
其次，将所有字符串与所有其他推文进行比较，并计算它们之间的字符串相似度。请注意，如果您要处理大量推文，这是完全不切实际的，但让我们开始吧。比较字符串后，您可以过滤以获取其他匹配字符串的索引。
使用这些索引，我们可以创建一个字符串图，为此我使用了networkx 包。这是必要的，因此我们可以提取图形的连接组件，其中每个连接组件代表一个相似字符串的网络。这不一定是正确的，因为对于深度图，一端的字符串不一定与另一端的字符串非常相似。但在实践中，结果证明效果很好。

设置

import networkx as nx
import Levenshtein
import random

df = pd.DataFrame({
    "tweet":["When will this year end? There is only misery and bad stuff! I hate 2020!", 
             "When will this year end? There are miseries and bad stuff only! I hate 2020!", 
             "I am a tweet with no obvious duplicates", 
             "Tweeeeeet!", 
             "Tweeet", 
             "Tweet tweet!"]
})

逻辑

def compare(tweet1, threshold=0.7):
    # compare tweets using Levenshtein distance (or whatever string comparison metric) 
    matches = df['tweet'].apply(lambda tweet2: (Levenshtein.ratio(tweet1, tweet2) >= threshold))

    # get positive matches
    matches = matches[matches].index.tolist()

    # convert to list of tuples
    return [*zip(iter(matches[:-1]), iter(matches[1:]))]

# create graph objects
nodes = df.index.tolist()
edges = [*itertools.chain(*df["tweet"].apply(compare))]

# create graphs
G = nx.Graph()
G.add_nodes_from(nodes)
G.add_edges_from(edges)

# get connected component indexes
grouped_indexes = [*nx.connected_components(G)]

# get a random choice index from each group
filtered_indexes = [random.choice([*_]) for _ in grouped_indexes]

df.loc[filtered_indexes]

输出

原始推文 DataFrame 的过滤子集。

    tweet
0   When will this year end? There is only misery ...
2   I am a tweet with no obvious duplicates
5   Tweet tweet!

方法2

我们可以使用无监督学习算法将字符串聚集在一起，例如 k-means 这是你的无监督算法的面包和黄油，它的缺点是你必须提前知道最佳的集群数量，或者更多通常通过测试来解决。但它有一个巨大的优势，即如果您要向数据集添加更多推文，您可以快速应用您的聚类模型并找到类似的推文。

关于如何做到这一点的教程有一百万零一个，但这里的基本方法是 (1) 清理文本，(2) 将文本转换为 TFIDF，(3) 计算相似度度量 ( cosine similarity 是常见的）在每个文档对之间，（4）然后训练你的 k-means（或类似的）模型。

如果您对这种方法感兴趣，这里有一些我在快速谷歌后发现的随机教程。

希望这会有所帮助！

【讨论】：

要记住的一点是，第一种方法是 GPL 许可的。

【解决方案2】：

您可以在两条推文之间使用余弦相似度：

from nltk.corpus import stopwords 
from nltk.tokenize import word_tokenize 

X =tweet1
Y =tweet2

# tokenization 
X_set = word_tokenize(X)  
Y_set = word_tokenize(Y) 

l1 =[];l2 =[] 


# form a set containing keywords of both strings  
rvector = X_set.union(Y_set)  
for w in vector: 
    if w in X_set: l1.append(1) # create a vector 
    else: l1.append(0) 
    if w in Y_set: l2.append(1) 
    else: l2.append(0) 
c = 0

# cosine formula  
for i in range(len(rvector)): 
    c+= l1[i]*l2[i] 
cosine = c / float((sum(l1)*sum(l2))**0.5)  
print("similarity: ", cosine) 
if cosine>=0.90:
   print("Similar")

【讨论】：