如何获得两个数组中的所有不同部分？答案

【问题标题】：How to get all different part in two array?如何获得两个数组中的所有不同部分？
【发布时间】：2018-05-26 03:57:04
【问题描述】：

假设我有一句话：

When Grazia Deledda submitted a short story to a fashion magazine at the age of 13

这句话分为两个列表：

# list 1
[When] [Grazia Deledda] [submitted a short story] [to] [a] [fashion magazine] [at] [the age of] [13]
# list 2
[When] [Grazia Deledda] [submitted] [a short story] [to] [a fashion] [magazine at] [the age of] [13]

现在我想得到这两个数组中的不同部分，这个例子的结果应该是：

[
    ([[submitted a short story]],[[submitted] [a short story]]),
    ([[a] [fashion magazine] [at]], [[a fashion] [magazine at]])
]

所以它应该满足这些要求：

每一对应该有相同的内容，例如：[[submitted a short story]]可以加入'submitted a short story'，[[submitted] [a short story]]也可以加入'submitted a short story'
每一对都应该有相同的开始位置和结束位置，例如：[[submitted a short story]] 从 3 开始，以 6 结束。 [[提交] [一个小故事]是一样的。
最重要的是每个人都应该是最短的，例如[[submitted a short story] [to]]和[[submitted] [a short story] [to]]也满足前两个要求，但不是最短的。 p>

有什么办法可以避免 O(n^2) 复杂度？

【问题讨论】：

标签： arrays string list compare

【解决方案1】：

我可能一开始就搞错了，这个问题可以很简单，我想到了一个好主意：

#!/usr/bin/env python
# encoding: utf-8

# list 1
llist = [["When"], ["Grazia", "Deledda"], ["submitted", "a", "short", "story"], ["to"], ["a"], ["fashion", "magazine"], ["at"], ["the", "age", "of",], ["13"],]
# list 2
rlist = [["When"], ["Grazia", "Deledda"], ["submitted"], ["a", "short", "story"], ["to"], ["a", "fashion"], ["magazine", "at"], ["the", "age", "of",], ["13"],]

loffset = -1
roffset = 0
rindex = 0
lstart = -1
rstart = -1
for lindex, litem in enumerate(llist):
    if loffset == roffset and litem != rlist[rindex]:
        lstart = lindex
        rstart = rindex
    loffset += len(litem)
    while roffset < loffset:
        roffset += len(rlist[rindex])
        rindex += 1
    if loffset == roffset and lstart >= 0:
        print(llist[lstart:lindex+1], rlist[rstart:rindex])
        lstart = -1

【讨论】：

【解决方案2】：

我将所有单词标记化并将它们作为一个序列填充为列表列表。然后我将第一个列表与第二个构建字符串缓冲区进行比较，并在索引长度计数不同时进行匹配。然后我在最后删除了 out1 和 out2 的重复索引值

from keras.preprocessing.text import Tokenizer 
tokenizer=Tokenizer()
# list 1
list1 = [["When"], ["Grazia Deledda"], ["submitted a short story"], ["to"], 
["a"], ["fashion magazine"], ["at"], ["the age of"], ["13"],["EOS"]]
# list 2
list2 = [["When"], ["Grazia Deledda"], ["submitted"], ["a short story"], ["to"], 
["a fashion"], ["magazine at"], ["the age of"], ["13"],["EOS"]]

tokenizer.fit_on_texts([" ".join(item) for item in list1])
tokenizer.fit_on_texts([" ".join(item) for item in list2])

seq1=[]
seq2=[]
for item1,item2 in zip(list1,list2):
     seq1.append(tokenizer.texts_to_sequences(item1))
     seq2.append(tokenizer.texts_to_sequences(item2))

out1=[]
out2=[]
out1_buffer=[]
out2_buffer=[]
current_index=0
string1=""
for seq1_index in range(len(seq1)-1):
    string1=""
    index=0
    out1_buffer=[]
    found=False
    #check each seq1 string accumulation until a match is found or the end of queue is detect 16 - maps to eos

    while seq1[seq1_index+index][0] != [16] and found==False:
        out1_buffer.append(seq1_index+index)
        seq_string=" ".join([str(token) for token in seq1[seq1_index+index][0]])
        if string1=="":
             string1=seq_string
        else:
             string1+=" "+seq_string
        string2=""
        out2_buffer=[]
        for seq2_index in range(current_index,len(seq2)-1):
            seq_string=" ".join([str(token) for token in seq2[seq2_index][0]])
            if string2=="":
                 string2=seq_string
            else:
                 string2+=" "+seq_string
            out2_buffer.append(seq2_index)
            count_seq1=len(out1_buffer)
            count_seq2=len(out2_buffer)
            if string1==string2 and count_seq1!=count_seq2:  
                 print("string_a", [list1[int(index)] for index in out1_buffer])
                 print("string_b",[list2[int(index)] for index in out2_buffer])
                 current_index=seq2_index+1
                 print("match",count_seq1,count_seq2)
                 for index1 in out1_buffer:
                     out1.append(index1)
                 for index2 in out2_buffer:
                     out2.append(index2)
                 out1_buffer=[]                    
                 out2_buffer=[]
                 found=True
                 break
         index+=1

tuple1=[]
tuple2=[]


result1=[]
for item1 in out1:
    found=False

    for item2 in out2:
          if list1[item1]==list2[item2]:
              found=True
              break
    

    if found==True:
         out2 = list(filter(lambda item2: list1[item1]!=list2[item2],out2))
        
    if found==False:
        result1.append(item1)

for item1 in result1:            
     tuple1.append(list1[item1])

for item2 in out2:
     tuple2.append(list2[item2])


tuple1=tuple(tuple1)
tuple2=tuple(tuple2)

 print("{}\n{}\n".format(tuple1,tuple2))

输出

 (['submitted a short story'], ['a'], ['fashion magazine'], ['at'])
 (['submitted'], ['a short story'], ['a fashion'], ['magazine at'])

【讨论】：