如果浏览器不支持框架+无法直接访问框架，如何自动获取框架的内容答案

【问题标题】：How to get contents of frames automatically if browser does not support frames + can't access frame directly如果浏览器不支持框架+无法直接访问框架，如何自动获取框架的内容
【发布时间】：2014-09-10 13:45:31
【问题描述】：

我正在尝试从this 等 URL 自动下载 PDF，以创建联合国决议库。

如果我使用漂亮的汤或机械化打开该 URL，我会得到“你的浏览器不支持框架”——如果我在 chrome 开发工具中使用复制作为 curl 功能，我会得到同样的结果。

在使用 mechanize 或 beautiful soup 时，“您的浏览器不支持框架”的标准建议是跟踪每个单独框架的来源并加载该框架。但如果这样做，我会收到一条错误消息，指出该页面不是authorized。

我该如何继续？我想我可以在僵尸或幻影中尝试这个，但我不想使用这些工具，因为我对它们不太熟悉。

【问题讨论】：

标签： python web-scraping html-parsing beautifulsoup mechanize

【解决方案1】：

好的，这是与requests 和BeautifulSoup 相关的有趣任务。

有一组对un.org 和daccess-ods.un.org 的基础调用很重要，它们设置了相关的cookie。这就是为什么您需要维护requests.Session() 并在访问 pdf 之前访问多个 url。

这是完整的代码：

import re
from urlparse import urljoin

from bs4 import BeautifulSoup
import requests


BASE_URL = 'http://www.un.org/en/ga/search/'
URL = "http://www.un.org/en/ga/search/view_doc.asp?symbol=A/RES/68/278"
BASE_ACCESS_URL = 'http://daccess-ods.un.org'

# start session
session = requests.Session()
response = session.get(URL, headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36'})

# get frame links
soup = BeautifulSoup(response.text)
frames = soup.find_all('frame')
header_link, document_link = [urljoin(BASE_URL, frame.get('src')) for frame in frames]

# get header
session.get(header_link, headers={'Referer': URL})

# get document html url
response = session.get(document_link, headers={'Referer': URL})
soup = BeautifulSoup(response.text)

content = soup.find('meta', content=re.compile('URL='))['content']
document_html_link = re.search('URL=(.*)', content).group(1)
document_html_link = urljoin(BASE_ACCESS_URL, document_html_link)

# follow html link and get the pdf link
response = session.get(document_html_link)
soup = BeautifulSoup(response.text)

# get the real document link
content = soup.find('meta', content=re.compile('URL='))['content']
document_link = re.search('URL=(.*)', content).group(1)
document_link = urljoin(BASE_ACCESS_URL, document_link)
print document_link

# follow the frame link with login and password first - would set the important cookie
auth_link = soup.find('frame', {'name': 'footer'})['src']
session.get(auth_link)

# download file
with open('document.pdf', 'wb') as handle:
    response = session.get(document_link, stream=True)

    for block in response.iter_content(1024):
        if not block:
            break

        handle.write(block)

您可能应该将单独的代码块提取到函数中，以使其更具可读性和可重用性。

仅供参考，在 selenium 或 Ghost.py 的帮助下，所有这些都可以通过真正的浏览器更轻松地完成。

希望对您有所帮助。

【讨论】：

感谢您的帮助。当你最后提到Ghost.py时，你的意思是GhostDriver吗？因为我用谷歌搜索并没有找到太多关于如何结合 selenium 和 Ghost.py 的信息。你能提供更多信息吗？