如何跳过文件中已经存在的行？答案

【问题标题】：how to skip lines that already exist in a file?如何跳过文件中已经存在的行？
【发布时间】：2019-07-10 23:34:43
【问题描述】：

我知道，这似乎是一个简单的问题，但请阅读我的问题。

我想提取符合以下模式的 html 类名：

regex = re.compile(r'([\w-]+)-([#\w\d,%()\.]+)')

并将其作为 CSS 样式写入不同的文件中。

为此，我有一本我们将要使用的值和属性的字典：

keyword = {
'c':'color',
'bg':'background',
'red':'#ed1a1a',
'blue':'#60a8ff'
#etc
}

示例：

html 文件：<div class="c-red bg-blue"> content </div>

css 文件中的输出：

.c-red{
color: red;
}
.bg-blue{
background: blue;
}

这是我的脚本，基本上可以做到这一点：

regex = re.compile(r'([\w-]+)-([#\w\d,%()\.]+)')
with open('index.html', 'r') as file:
  with open('style.css', 'a+') as newfile:
    lines = file.readlines()
    for line in lines:
        if 'class="' in line:
          to_replace = regex.findall(line)
          for key in to_replace:         
              prop=key[0]  
              value=key[1] 
              name='.'+prop+'-'+value
              if prop and value in keyword:
                var1 =('\n'+name+'{'+
                  '\n'+keyword[prop]+': '+
                  keyword[value]+';'+
                  '\n'+'}')
                newfile.write(var1)

但是如果我有多个相似的 HTML 字符串，例如：

<div class="c-red bg-blue"> content </div>
<div class="c-red bg-blue"> content2 </div>
<div class="c-red bg-blue"> content2 </div>

脚本将编写与 HTML 文件中的字符串一样多的 CSS 命令。

如何防止这种重复？

我试过了：

var1=''

和

if var1 in newfile:
  break
else:
  newfile.write(var1)

但这些都不起作用。

【问题讨论】：

你知道BeatyfulSoup吗？
问题出在if var1 in newfile: newfile 不是新文件的内容。如果你想要内容，你必须阅读这个文件。
@Matej 是的，我在a+ 模式下尝试过。
只需将 var1s 存储在一个集合或其他东西中，然后在写入之前检查它们是否存在。
@valeria 是的，没错，但你必须阅读这个文件，比如if var1 in newfile.read()，但它不是很有效。

标签： python regex web-scraping

【解决方案1】：

在您写入之前添加附加到集合。然后在写入之前简单地检查集合。这不会检查之前写入新文件的项目

written = set()

regex = re.compile(r'([\w-]+)-([#\w\d,%()\.]+)')
with open('index.html', 'r') as file:
  with open('style.css', 'a+') as newfile:
    lines = file.readlines()
    for line in lines:
        if 'class="' in line:
          to_replace = regex.findall(line)
          for key in to_replace:         
              prop=key[0]  
              value=key[1] 
              name='.'+prop+'-'+value
              if prop and value in keyword:
                var1 =('\n'+name+'{'+
                  '\n'+keyword[prop]+': '+
                  keyword[value]+';'+
                  '\n'+'}')
                if var1 not in written: #check if you already wrote it
                    newfile.write(var1) # if not write it
                    written.add(var1) # you wrote it so add it the list of things you check against

【讨论】：

哦，这个在我的情况下似乎工作得更好！谢谢
@Matej true, 生病添加为注释

【解决方案2】：

我编辑了你的代码：

import re

keyword = {
'c':'color',
'bg':'background',
'red':'#ed1a1a',
'blue':'#60a8ff'
#etc
}

regex = re.compile(r'([\w-]+)-([#\w\d,%()\.]+)')
with open('index.html', 'r') as file:
    with open('style.css', 'a+') as newfile:
        content = newfile.read()

        lines = file.readlines()
        for line in lines:
                if 'class="' in line:
                    to_replace = regex.findall(line)
                    for key in to_replace:
                            name='.'+key[0]+'-'+key[1]
                            prop=key[0] 
                            value=key[1] 
                            if prop and value in keyword:
                                var1 =('\n'+name+'{'+ '\n' + keyword[prop] + ': ' + keyword[value] + ';' + '\n'+'}')

                                if not var1 in content:
                                    newfile.write(var1)
                                    content += var1

content = newfile.read() 将读取带有样式的文件内容并将其保存到变量中。然后在每个新的var1 中，它会尝试在content 中找到它，如果var1 不在这里，它会将其写入文件并将其附加到content 变量中。

输出：

.c-red{
color: #ed1a1a;
}
.bg-blue{
background: #60a8ff;
}

【讨论】：