删除重复项的过程耗时过长答案

【问题标题】：The process of removing duplicates taking too long删除重复项的过程耗时过长
【发布时间】：2018-10-09 17:05:32
【问题描述】：

我有一个非常大的 csv 文件，其中包含大约 70,000 条推文，其中包含我必须删除的重复值。该文件包含三列（ID、Creation_Date、Text）。

下面给出了一个csv文件的例子：

       ID                          Date                                  Text
"745828866334269441"     "Thu Jun 23 04:05:33 +0000 2017"              "Any TEXT"
"745828863334269434"     "Thu Jun 23 04:06:33 +0000 2017"              "Any TEXT"
"745828343334269425"     "Thu Jun 23 04:07:33 +0000 2017"              "Any TEXT"  
      ................ and so on

我在 Python 中使用来自 Difflib 的 sequenceMatcher。该脚本运行良好。脚本如下：

import csv
from difflib import SequenceMatcher

csvInputFile=open('inputFileWithDups.csv', 'r', encoding="utf-8", newline='') # Input file name with duplicates
csvOutputFile=open('outputFileWithoutDups.csv', 'w', encoding="utf-8", newline='') # Output file name without duplicates

csvReader = csv.reader(csvInputFile)
csvWriter = csv.writer(csvOutputFile, delimiter=',',quotechar='"', quoting=csv.QUOTE_ALL)
cleanData = set() # an empty set that will be used to compare and then store the clean tweets without duplicates

for row in csvReader: # reading the inputfile 
   add=True 
   a=row[2] # our third csv column with tweets text that we have to compare for duplicates
   for cleantweet in cleanData:# reading the cleanData set to compare tweet texts.
        f= SequenceMatcher(None,cleantweet,a).ratio() #cleantweet vs row[2] which is text  
        if f > 0.73:
            print(f)
            add=False

   if add: # This will add all the tweets that have a similarty lower than 0.73 (here 1.0 means a 100 percent similarity)
       cleanData.add(row[2])
       csvWriter.writerow(row) # adding all the tweets without duplicates into the new csv file.
csvOutputFile.close()
csvInputFile.close()

但是只有 4GB 内存的 PC 需要花费太多时间来处理。例如：一个只有 5000 条推文的文件需要将近 7 个小时来处理。我要比较的下一个文件包含 50,000 条推文，这意味着可能需要 3 天的工作时间 :(
如果有人能帮助我加快这个过程，我将不胜感激。
谢谢

【问题讨论】：

为什么不将它们拉入数据库并让它处理问题？
我不知道序列匹配器的作用。你的“清理标准”是什么？某种独特的推文ID？或者你会比较推文文本，如果它们匹配超过 73%，你会丢弃它们？您的 csv 文件看起来如何 - 在您的问题中发布几行（编辑它，将其格式化为代码）？如果您对粗花呢文本进行语义分析，这将需要时间，这是一个困难的话题，背后有大量的计算/数据存储......
是的，我正在根据阈值（0.73，1 表示 100% 相似度）比较推文的文本（第 2 列），如果推文的阈值高于 0.73，则它是重复的，并且必须被删除，其他推文被写入干净的数据集。
@IgnacioVazquez-Abrams：是的，我的主管建议做得很好，以后会这样做，因为要放入数据库的文件太多，这需要时间。但目前，我必须为这个找到解决方案。

标签： python python-3.x csv twitter nlp

【解决方案1】：

在 Linux 系统上，您可以使用以下命令从文本文件中删除重复的行：

awk '!seen[$0]++' duplicate.csv > uniqlines.csv

使用 3,700,000 行文件，耗时 49 秒。我的电脑是 16Go RAM，但它没有达到 4.3Go 的使用率，从 4.1Go 开始运行。

【讨论】：

实际上我必须根据某个相似的百分比删除重复项（如我提供的代码所示）。所以我只能使用上面提到的代码。例如：
以下代码的工作方式与您的示例类似：for row in csvReader: #print(row[2]) if row[2] in cleanData or re.sub('^RT @.* : ', '', row[2]) in cleanData: continue cleanData.append(row[2]) csvWriter.writerow(row) 此代码只需几分钟即可删除重复项。我还必须删除部分重复项。