【发布时间】:2018-05-10 13:49:43
【问题描述】:
我有一个脚本,它读取一个 html 文件并从这个文件中提取相关的行。但我在打印文件名时遇到问题。文件名是 source1.html source2.html 和 source3.html。而是打印 source2.html source3.html, source4.html。
from bs4 import BeautifulSoup
import re
import os.path
n = 1
filename = "source"+str(n)+".html"
savefile = open('OUTPUT.csv', 'w')
while os.path.isfile(filename):
n = n+1
strjpgs = "Extracted Layers: \n \n"
file = open(filename, "r")
filename = "source"+str(n)+".html"
soup = BeautifulSoup (file, "html.parser")
thedata = soup.find("div", class_="cplayer")
strdata = str(thedata)
DoRegEx = re.compile('/([^/]+)\.jpg')
jpgs = DoRegEx.findall(strdata)
strjpgs = strjpgs + "\n".join(jpgs) + "\n \n"
savefile.write(filename + '\n')
savefile.write(strjpgs)
print(filename)
print(strjpgs)
savefile.close()
print "done"
【问题讨论】:
-
在循环结束而不是开始时增加
n(并更新文件名)。
标签: python python-2.7 loops