【发布时间】:2018-12-06 10:52:52
【问题描述】:
对于我的课程作业,我必须构建一个网络抓取工具,用于为 img、word docs 和 pdf 绘制网站并将它们下载到文件中,我已经完成了 img 下载工作,但是当我更改代码以下载 docs 或 pdf 时,它根本找不到,我使用beautifulsoup 来抓取网站,而且我知道网站上有文档和pdf,只是无法下载它们。
from bs4 import BeautifulSoup
import urllib.request
import shutil
import requests
from urllib.parse import urljoin
import sys
import time
import os
import url
import hashlib
import re
url = 'http://www.soc.napier.ac.uk/~40009856/CW/'
path=('c:\\temp\\')
def ensure_dir(path):
directory = os.path.dirname(path)
if not os.path.exists(path):
os.makedirs(directory)
return path
os.chdir(ensure_dir(path))
def webget(url):
response = requests.get(url)
html = response.content
return html
def get_docs(url):
soup = make_soup(url)
docutments = [docs for docs in soup.findAll('doc')]
print (str(len(docutments)) + " docutments found.")
print('Downloading docutments to current working directory.')
docutments_links = [each.get('src') for each in docutments]
for each in docutments_links:
try:
filename = each.strip().split('/')[-1].strip()
src = urljoin(url, each)
print ('Getting: ' + filename)
response = requests.get(src, stream=True)
# delay to avoid corrupted previews
time.sleep(1)
with open(filename, 'wb') as out_file:
shutil.copyfileobj(response.raw, out_file)
except:
print(' An error occured. Continuing.')
print ('Done.')
if __name__ == '__main__':
get_docs(url)
【问题讨论】:
标签: python web-scraping download