【问题标题】:Removing elements alike in a list of strings删除字符串列表中的类似元素
【发布时间】:2017-04-20 20:39:32
【问题描述】:

这是我第一次在这里问问题,我对此很陌生,所以我会尽力而为。我有一个包含短语的列表,我想消除所有类似的短语,例如:

array = ["A very long string saying some things", 
         "Another long string saying some things", 
         "extremely large string saying some things", 
         "something different", 
         "this is a test"]

我想要这个结果:

array2 = ["A very long string saying some things", 
          "something different", 
          "this is a test"]`

我有这个:

for i in range(len(array)):
    swich=True
    for j in range(len(array2)):
        if (fuzz.ratio(array[i],array2[j]) >= 80) and (swich == True):
            swich=False
            pass
        if (fuzz.ratio(array[i],array2[j]) >= 80) and (swich == False):
            array2.pop(j)

但它给了我列表IndexError...

fuzzy.ratio 比较两个字符串并给出一个介于 0 和 100 之间的值,越大,字符串越相似。

我要做的是逐个元素比较列表,第一次找到两个相似的字符串时,只需打开开关并传递,从那时起,每个相似的发现,弹出array2的元素。我完全愿意接受任何建议。

【问题讨论】:

  • 给出确切的错误跟踪...哪个列表有索引错误?

标签: python arrays string list fuzzing


【解决方案1】:

如何使用不同的库来压缩代码并减少循环次数?

import difflib

def remove_similar_words(word_list):
    for elem in word_list:
        first_pass = difflib.get_close_matches(elem, word_list)
        if len(first_pass) > 1:
            word_list.remove(first_pass[-1])
            remove_similar_words(word_list)
    return word_list


l = ["A very long string saying some things", "Another long string saying some things", "extremely large string saying some things", "something different", "this is a test"]

remove_similar_words(l)

['A very long string saying some things',
 'something different',
 'this is a test']

【讨论】:

    【解决方案2】:

    您得到的错误是由您正在迭代的列表的修改引起的。 (永远不要添加/删除/替换您当前迭代的可迭代元素!)range(len(array2)) 知道长度是 N,但是在您 array2.pop(j) 之后,长度不再是 N,而是 N-1。之后尝试访问第 N 个元素时,您会收到 IndexError,因为列表现在更短了。

    快速猜测另一种方法:

    original = ["A very long string saying some things", "Another long string saying some things", "extremely large string saying some things", "something different", "this is a test"]
    
    filtered = list()
    
    for original_string in original:
        include = True
        for filtered_string in filtered:
            if fuzz.ratio(original_string, filtered_string) >= 80:
                include = False
                break
        if include:
            filtered.append(original_string)
    

    请注意for string in array 循环,它更“pythonic”,不需要整数变量或范围。

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 2020-04-05
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2021-12-03
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多