使用python删除文件中的多余单词答案

【问题标题】：Removing extra words in a file using python使用python删除文件中的多余单词
【发布时间】：2014-05-04 06:11:24
【问题描述】：

您好，我正在学习 Python，出于好奇，我编写了一个程序来删除文件中的多余单词。我正在比较文件“text1.txt”中的测试。和'text2.txt'，根据text1中的测试，我删除了test2中多余的单词。

# Bin/ Python
text1 = open('text1.txt','r')
text2 = open('text2.txt','r')

t_l1 = text1.readlines()
t_l2 = text2.readlines()

# printing to check if the file contents were read properly.
print ' Printing the file 1 contents:'
w_t1 = [] 
for i in range(len(t_l1)):
    w_t1 = t_l1[i].split(' ')
for j in range(len(w_t1)):
    print w_t1[j]
#printing to see if the contents were read properly. 
print'File 2 contents:'
w_t2 = []
for i in range(len(t_l2)):
    w_t2.extend(t_l2[i].split(' '))
for j in range(len(w_t2)):
    print w_t2[j]


print 'comparing and deleting the excess variables.'

i = 1
while (i<=len(w_t1)):
    if(w_t1[i-1] == w_t2[i-1]):
        print w_t1[i-1]
        i += 1
# I put all words of file1 in list w_t1 and file2 in list w_t2. Now I am checking if
# each word in w_t1 is same as word in same place of w_t2 if not, i am deleting the
# that word in w_t2 and continuing the while loop. 
    else: 
        w.append(str(w_t2[i-1]))
        w_t2.remove(w_t2[i-1])
        i = i
print 'The extra words are: '+str(w) +'\n'
print w 
print 'The original words are: '+ str(w_t2) +'\n'
print 'The extra values are: '
for item in w:
    print item
# opening the file out.txt to write the output. 
out = open('out.txt', 'w')
out.write(str(w))

# I am closing the files
text1.close()
text2.close()
out.close()

说 text1.txt 文件有“生日快乐亲爱的朋友”字样并且 text2.txt 有“祝你生日快乐，我亲爱的最好的朋友”

程序应该在 text2.txt 中给出额外的单词“claps, to, you, my, Best”

上面的程序可以正常工作，但是如果我必须对包含数百万字或数百万行的文件执行此操作怎么办？检查每个单词似乎不是一个好主意。我们有任何 Python 预定义的函数吗？

P.S：如果这是一个错误的问题，请多多包涵，我正在学习 python。很快我就不再问这些了。

【问题讨论】：

标签： python python-2.7

【解决方案1】：

这似乎是一个“设置”问题。首先在一个集合结构中添加你的单词：

textSet1 = set()
with open('text1.txt','r') as text1:
   for line in text1:
      for word in line.split(' '):
         textSet1.add(word)

textSet2 = set()
with open('text2.txt','r') as text2:
   for line in text2:
      for word in line.split(' '):
         textSet2.add(word)

然后简单地应用集差算子

textSet2.difference(textSet1)

给你这个结果

set(['claps', 'to', 'you', 'my', 'Best'])

这样就可以从之前的结构中获取列表

list(textSet2.difference(textSet1))

['claps', 'to', 'you', 'my', 'Best']

那么，你如何阅读here 你不应该担心大文件的大小，因为使用给定的加载器

读取下一行时，前一行将被垃圾回收除非您在其他地方存储了对它的引用

更多关于延迟文件加载here.

最后，在一个真正的问题中，我认为第一组（坏词）具有相对较小的大小，第二组具有大量数据。如果是这种情况，那么您可以避免创建第二组：

diff = []
with open('text2.txt','r') as text2:
   for line in text2:
      for word in line.split(' '):
         if word in textSet1:
             diff.append(word)

【讨论】：

从概念上讲这是正确的想法，但对于某些输入，内存可能会耗尽。不过，在这种情况下，我认为这不会发生。
谢谢@Salvatore Avanzo :) 所以这类问题应该使用集合来解决。那么我需要为此导入任何库吗？？
@user3543477：是的，列出的代码不需要导入。