【问题标题】:Almost the same duplicates but only different in length几乎相同的副本,只是长度不同
【发布时间】:2018-10-31 01:58:51
【问题描述】:

我想删除几乎相同的重复项,但只保留最长的一个。我正在考虑首先比较第一个单词或前几个单词以过滤掉候选进行比较。然后比较剩余元素的长度。如果它是最长的,我会将它写入一个新的文本文件。 这里是测试文件https://drive.google.com/file/d/1tdewlNtIqBMaldgrUr02kbCKDyndXbSQ/view?usp=sharing

输入

I am Harry.
I am Harry. I like 
I am Harry. I like to eat apple.
I am Garry.
I am Garry. I am Hap
I am Garry. I am Happy.

输出

I am Harry. I like to eat apple.
I am Garry. I am Happy.

我正在用 Python 做这件事,但它就是行不通。

代码

f1 = open('a.txt','r') # Read from file
ListofLine = f1.readlines() # Read the line into list
f2 = open('n.txt','w') # Open new file to write

# Iterate all the sentences to compare
for x in len(ListofLine):
    # Comparing first word of the sentences
    if(ListofLine[x].split()[0] = ListofLine[x+1].split()[0]):
        # Comparing the length and keep the longest length sentences
        if(len(ListofLine[x])>len(ListofLine[x+1])):
            f2.write(ListofLine[x])

f1.close()   
f2.close()

【问题讨论】:

标签: python text duplicates


【解决方案1】:

您需要定义一个标准才能找到您所称的公共部分。它可以是第一句话,例如“我是哈利”。

要解析一个句子,你可以使用正则表达式,例如:

import re


# match a sentence finishing by a dot
re_sentence = r'((?:(?!\.|$).)+\.?)\s*'
find_all_sentences = re.compile(re_sentence, flags=re.DOTALL).findall

这里 find_all_sentences 是一个函数。这是re.compile findall 函数的结果。它是查找一行中所有句子的助手。

一旦定义了这个函数,你就可以用它来解析行并提取第一个句子,它被认为是共同的部分来检查。

任何时候你匹配一个句子,你都可以将它存储在一个dict中(这里我使用了一个OrdererdDict来保持行的顺序)。当然,如果你找到更长的行,你可以用这个替换现有的行:

import collections

lines = [
    "I am Harry. I like to eat apple",
    "I am Harry.",
    "I am Garry.",
    "I am Garry. I am Happy."]

longuest = collections.OrderedDict()
for line in lines:
    sentences = find_all_sentences(line)
    first = sentences[0]
    if first in longuest:
        longuest[first] = max([longuest[first], line], key=lambda l: len(l))
    else:
        longuest[first] = line

最后,您可以将结果序列化到文件中。或者打印出来:

for line in longuest.values():
    print(line)

要写入文件,请使用 with 语句:

import io


out_path = 'path/to/sentences.txt'

with io.open(out_path, mode='w', encoding='utf-8') as f:
    for line in longuest.values():
        print(line, file=f)

【讨论】:

    【解决方案2】:

    用最少的努力:

    技巧是不计算新字符串(或行)的完整长度,并使用 startswith() 来匹配已检查的作为前缀。有了这个功能,当你得到一条比之前的稍长(+1)的线时,你就停下来了,这才是最重要的。

    ListofLine=["I am Harry.",
    "I am Harry. I like to eat apple.",
    "I am Garry.",
    "I am Garry. I am Happy."]
    list=[]   # to contain the longest ones
    
    for line in ListofLine:  # ListofLine are basically the input lines
        found = False
        for k in list:  
            if line.startswith(k):
                list.remove(k)  # removes relatively smaller one
                list.append(line) # add the longer one instead
                found= True
                break
        if found == False: list.append(line)
    for item in list:
        print item
    

    最后,列表将包含最长的项目。

    https://www.jdoodle.com/embed/v0/vIG

    打印:

    I am Harry. I like to eat apple.
    I am Garry. I am Happy.
    

    【讨论】:

      【解决方案3】:

      如果您可以定义将每一行映射到不同类的函数,则可以使用itertools.groupby

      例如,假设两个字符串具有相同的 10 个起始字符,则它们是相似的。

      data = """I am Harry.
      I am Harry. I like
      I am Harry. I like to eat apple.
      I am Garry.
      I am Garry. I am Hap
      I am Garry. I am Happy.""".split('\n')
      
      from itertools import groupby
      criterion = lambda s: s[:10]
      
      result = [max(g[1], key=len) for g in groupby(data, criterion)]
      # ['I am Harry. I like to eat apple.', 'I am Garry. I am Happy.']
      

      【讨论】:

        猜你喜欢
        • 2011-12-27
        • 1970-01-01
        • 2023-02-04
        • 2023-03-26
        • 1970-01-01
        • 1970-01-01
        • 2019-04-21
        • 2017-03-16
        • 1970-01-01
        相关资源
        最近更新 更多