【发布时间】:2016-09-20 11:06:20
【问题描述】:
我正在尝试从 aspx 页面 scrape 图片即使在阅读了几个线程之后也无法弄清楚如何做到这一点这是原始代码
from bs4 import BeautifulSoup as bs
import urlparse
import urllib2
from urllib import urlretrieve
import os
import sys
import subprocess
import re
def thefunc(url, out_folder):
c = False
我已经定义了 aspx 页面的标题和一个区分普通页面和 aspx 页面的 if 语句
select = raw_input('Is this a .net aspx page ? y/n : ')
if select.lower().startswith('y'):
usin = raw_input('Specify origin of .net page : ')
usaspx = raw_input('Specify aspx page url : ')
aspx 页面的标题
headdic = {
'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Origin': usin,
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.57 Safari/537.17',
'Content-Type': 'application/x-www-form-urlencoded',
'Referer': usaspx,
'Accept-Encoding': 'gzip,deflate,sdch',
'Accept-Language': 'en-US,en;q=0.8',
'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3'
}
c = True
if c:
req = urllib2.Request(url, headers=headic)
else:
req = urllib2.Request(url, headers={'User-Agent' : "Magic Browser"})
resp = urllib2.urlopen(req)
soup = bs(resp, 'lxml')
parsed = list(urlparse.urlparse(url))
print '\n',len(soup.findAll('img')), 'images are about to be downloaded'
for image in soup.findAll("img"):
print "Image: %(src)s" % image
filename = image["src"].split("/")[-1]
parsed[2] = image["src"]
outpath = os.path.join(out_folder, filename)
try:
if image["src"].lower().startswith("http"):
urlretrieve(image["src"], outpath)
else:
urlretrieve(urlparse.urlunparse(parsed), outpath)
except:
print 'OOPS missed one for some reason !!'
pass
try:
put = raw_input('Please enter the page url : ')
reg1 = re.compile('^http*',re.IGNORECASE)
reg1.match(put)
except:
print('Type the url carefully !!')
sys.exit()
fol = raw_input('Enter the foldername to save the images : ')
if os.path.isdir(fol):
thefunc(put, fol)
else:
subprocess.call('mkdir', fol)
thefunc(put, fol)
我对 aspx 检测和创建 aspx 页面的标题做了一些修改,但是接下来如何修改我被困在这里
***here is the aspx page link***http://www.foxrun.com.au/Products/Cylinders_with_Gadgets.aspx
抱歉,如果我不清楚,您可以看到我是编程新手,我要问的问题是,当我单击下一页时,如何获取从 aspx 页面获取的图像浏览器中的按钮导致如果我只能抓取一个页面导致 url 不会改变,除非我以某种方式发送 http 帖子告诉页面显示带有新图片的下一页,因为 url 保持不变我希望我清楚
【问题讨论】:
标签: python html asp.net python-2.7 web-scraping