【发布时间】:2013-07-27 20:51:14
【问题描述】:
我正在修改 this script 以抓取页面 like this 以获取书页图像。直接从stackoverflow使用脚本,它会正确返回所有图像,除了我想要的一张图像。该页面作为空文件返回,其标题如下:img.php?dir=39d761947ad84e71e51e3c300f7af8ff&file=1.png。
在下面我的修改版本中,我只拉书页图像。
这是我的脚本:
from bs4 import BeautifulSoup as bs
import urlparse
from urllib2 import urlopen
from urllib import urlretrieve
import os
import sys
out_folder = '/Users/Craig/Desktop/img'
def main(url, out_folder):
soup = bs(urlopen(url))
parsed = list(urlparse.urlparse(url))
for image in soup.findAll('img', id='page_image'):
print "Image: %(src)s" % image
filename = image["src"].split("/")[-1]
parsed[2] = image["src"]
outpath = os.path.join(out_folder, filename)
if image["src"].lower().startswith("http"):
urlretrieve(image["src"], outpath)
else:
urlretrieve(urlparse.urlunparse(parsed), outpath)
def _usage():
print "usage: python dumpimages.py http://example.com [outpath]"
if __name__ == "__main__":
url = sys.argv[-1]
if not url.lower().startswith("http"):
out_folder = sys.argv[-1]
url = sys.argv[-2]
if not url.lower().startswith("http"):
_usage()
sys.exit(-1)
main(url, out_folder)
有什么想法吗?
【问题讨论】:
标签: python parsing scripting web-scraping