【问题标题】:How to get all different part in two array?如何获得两个数组中的所有不同部分?
【发布时间】:2018-05-26 03:57:04
【问题描述】:

假设我有一句话:

When Grazia Deledda submitted a short story to a fashion magazine at the age of 13

这句话分为两个列表:

# list 1
[When] [Grazia Deledda] [submitted a short story] [to] [a] [fashion magazine] [at] [the age of] [13]
# list 2
[When] [Grazia Deledda] [submitted] [a short story] [to] [a fashion] [magazine at] [the age of] [13]

现在我想得到这两个数组中的不同部分,这个例子的结果应该是:

[
    ([[submitted a short story]],[[submitted] [a short story]]),
    ([[a] [fashion magazine] [at]], [[a fashion] [magazine at]])
]

所以它应该满足这些要求:

  • 每一对应该有相同的内容,例如:[[submitted a short story]]可以加入'submitted a short story'[[submitted] [a short story]]也可以加入'submitted a short story'

  • 每一对都应该有相同的开始位置和结束位置,例如:[[submitted a short story]] 从 3 开始,以 6 结束。 [[提交] [一个小故事]是一样的。

  • 最重要的是每个人都应该是最短的,例如[[submitted a short story] [to]][[submitted] [a short story] [to]]也满足前两个要求,但不是最短的。 p>

有什么办法可以避免 O(n^2) 复杂度?

【问题讨论】:

    标签: arrays string list compare


    【解决方案1】:

    我可能一开始就搞错了,这个问题可以很简单,我想到了一个好主意:

    #!/usr/bin/env python
    # encoding: utf-8
    
    # list 1
    llist = [["When"], ["Grazia", "Deledda"], ["submitted", "a", "short", "story"], ["to"], ["a"], ["fashion", "magazine"], ["at"], ["the", "age", "of",], ["13"],]
    # list 2
    rlist = [["When"], ["Grazia", "Deledda"], ["submitted"], ["a", "short", "story"], ["to"], ["a", "fashion"], ["magazine", "at"], ["the", "age", "of",], ["13"],]
    
    loffset = -1
    roffset = 0
    rindex = 0
    lstart = -1
    rstart = -1
    for lindex, litem in enumerate(llist):
        if loffset == roffset and litem != rlist[rindex]:
            lstart = lindex
            rstart = rindex
        loffset += len(litem)
        while roffset < loffset:
            roffset += len(rlist[rindex])
            rindex += 1
        if loffset == roffset and lstart >= 0:
            print(llist[lstart:lindex+1], rlist[rstart:rindex])
            lstart = -1
    

    【讨论】:

      【解决方案2】:

      我将所有单词标记化并将它们作为一个序列填充为列表列表。然后我将第一个列表与第二个构建字符串缓冲区进行比较,并在索引长度计数不同时进行匹配。然后我在最后删除了 out1 和 out2 的重复索引值

      from keras.preprocessing.text import Tokenizer 
      tokenizer=Tokenizer()
      # list 1
      list1 = [["When"], ["Grazia Deledda"], ["submitted a short story"], ["to"], 
      ["a"], ["fashion magazine"], ["at"], ["the age of"], ["13"],["EOS"]]
      # list 2
      list2 = [["When"], ["Grazia Deledda"], ["submitted"], ["a short story"], ["to"], 
      ["a fashion"], ["magazine at"], ["the age of"], ["13"],["EOS"]]
      
      tokenizer.fit_on_texts([" ".join(item) for item in list1])
      tokenizer.fit_on_texts([" ".join(item) for item in list2])
      
      seq1=[]
      seq2=[]
      for item1,item2 in zip(list1,list2):
           seq1.append(tokenizer.texts_to_sequences(item1))
           seq2.append(tokenizer.texts_to_sequences(item2))
      
      out1=[]
      out2=[]
      out1_buffer=[]
      out2_buffer=[]
      current_index=0
      string1=""
      for seq1_index in range(len(seq1)-1):
          string1=""
          index=0
          out1_buffer=[]
          found=False
          #check each seq1 string accumulation until a match is found or the end of queue is detect 16 - maps to eos
      
          while seq1[seq1_index+index][0] != [16] and found==False:
              out1_buffer.append(seq1_index+index)
              seq_string=" ".join([str(token) for token in seq1[seq1_index+index][0]])
              if string1=="":
                   string1=seq_string
              else:
                   string1+=" "+seq_string
              string2=""
              out2_buffer=[]
              for seq2_index in range(current_index,len(seq2)-1):
                  seq_string=" ".join([str(token) for token in seq2[seq2_index][0]])
                  if string2=="":
                       string2=seq_string
                  else:
                       string2+=" "+seq_string
                  out2_buffer.append(seq2_index)
                  count_seq1=len(out1_buffer)
                  count_seq2=len(out2_buffer)
                  if string1==string2 and count_seq1!=count_seq2:  
                       print("string_a", [list1[int(index)] for index in out1_buffer])
                       print("string_b",[list2[int(index)] for index in out2_buffer])
                       current_index=seq2_index+1
                       print("match",count_seq1,count_seq2)
                       for index1 in out1_buffer:
                           out1.append(index1)
                       for index2 in out2_buffer:
                           out2.append(index2)
                       out1_buffer=[]                    
                       out2_buffer=[]
                       found=True
                       break
               index+=1
      
      tuple1=[]
      tuple2=[]
      
      
      result1=[]
      for item1 in out1:
          found=False
      
          for item2 in out2:
                if list1[item1]==list2[item2]:
                    found=True
                    break
          
      
          if found==True:
               out2 = list(filter(lambda item2: list1[item1]!=list2[item2],out2))
              
          if found==False:
              result1.append(item1)
      
      for item1 in result1:            
           tuple1.append(list1[item1])
      
      for item2 in out2:
           tuple2.append(list2[item2])
      
      
      tuple1=tuple(tuple1)
      tuple2=tuple(tuple2)
      
       print("{}\n{}\n".format(tuple1,tuple2))
      

      输出

       (['submitted a short story'], ['a'], ['fashion magazine'], ['at'])
       (['submitted'], ['a short story'], ['a fashion'], ['magazine at'])
      

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 2018-07-11
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        相关资源
        最近更新 更多