【发布时间】:2020-06-19 17:13:17
【问题描述】:
我要去这个州网站并尝试获取他们那里的 pdf 文件以及裁员信息。当我运行我的代码时,我没有收到任何错误。但是,.pdf 文件总是一团糟——Adobe 无法打开它们。
from bs4 import BeautifulSoup
from requests import Session
import re
import urllib.request
import requests
import time
session = Session()
session.headers.update({
"user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36"
})
init_session = session.get(url="https://mn.gov/deed/programs-services/dislocated-worker/reports/")
soup = BeautifulSoup(init_session.content, "html.parser")
MN_1 = soup.find_all('a', {'href': re.compile(r'/deed/assets/mass-layoff.*')})
MN_1 = [str(a) for a in MN_1]
MN_1 = [a for a in MN_1 if "2020" in a]
MN_1 = [re.search("/deed.*pdf", a).group(0) for a in MN_1]
url_head = 'https://mn.gov'
# looping through list of urls to get all 2020 Minnesota WARN reports
# There's a problem here; all of the returned .pdfs are corrupted; I added the time.sleep() thinking
# maybe python just needed more time to render them or something; still get bad .pdfs
for url in range(len(MN_1)):
time.sleep(5)
url_u = url_head+MN_1[url]
filename = 'Minnessota_WARN'+str(url)+'.pdf'
stuff = requests.get(url_u)
with open(filename, 'wb') as f:
f.write(stuff.content)
【问题讨论】:
标签: python-3.x beautifulsoup python-requests