使用 selenium 和 BeautifulSoup 获取页面的可见内容答案

【问题标题】：Get visible content of a page using selenium and BeautifulSoup使用 selenium 和 BeautifulSoup 获取页面的可见内容
【发布时间】：2017-02-12 10:54:03
【问题描述】：

我想检索网页的所有可见内容。比如说this 网页。我正在远程使用带有 selenium 的无头 Firefox 浏览器。

我使用的脚本是这样的

driver = webdriver.Remote('http://0.0.0.0:xxxx/wd/hub', desired_capabilities)
driver.get(url)
dom = BeautifulSoup(driver.page_source, parser)

f = dom.find('iframe', id='dsq-app1')
driver.switch_to_frame('dsq-app1')
s = driver.page_source
f.replace_with(BeautifulSoup(s, 'html.parser'))

with open('out.html', 'w') as fe:
    fe.write(dom.encode('utf-8'))

这应该加载页面，解析 dom，然后用它的可见内容替换 id dsq-app1 的 iframe。如果我通过我的 python 命令行一一执行这些命令，它会按预期工作。然后我可以看到所有可见内容的段落。相反，当我通过执行脚本或将所有这些 sn-p 粘贴到我的解释器中一次执行所有这些命令时，它的行为会有所不同。段落不见了，内容还是json格式的，但不是我想要的。

知道为什么会发生这种情况吗？可能与replace_with 有关？

【问题讨论】：

标签： python html selenium beautifulsoup

【解决方案1】：

听起来当您的代码尝试访问 dom 元素时，它们尚未加载。

尝试wait 让元素完全加载然后替换。

当您逐个命令运行它时，这适用于您，因为这样您就可以在执行更多命令之前让驱动程序加载所有元素。

【讨论】：

【解决方案2】：

为了补充 Or Duan 的答案，我提供了我最终做了什么。查找页面或页面的某些部分是否已完全加载的问题是一个复杂的问题。我尝试使用隐式和显式等待，但我再次收到半载帧。我的解决方法是检查原始文档的readyState和iframe的readyState。

这是一个示例函数

def _check_if_load_complete(driver, timeout=10):
    elapsed_time = 1
    while True:
        if (driver.execute_script('return document.readyState') == 'complete' or
                elapsed_time == timeout):
            break
        else:
            sleep(0.0001)
        elapsed_time += 1

然后我在将驱动程序的焦点更改为 iframe 后立即使用了该功能

driver.switch_to_frame('dsq-app1')
_check_if_load_complete(driver, timeout=10)

【讨论】：

【解决方案3】：

检测到需要的ID/CSS_SELECTOR/CLASS或LINK后尝试获取Page Source。

您始终可以使用 Selenium WebDriver 的显式等待。

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
driver = webdriver.Remote('http://0.0.0.0:xxxx/wd/hub', desired_capabilities)
driver.get(url)
f = WebDriverWait(driver,10).until(EC.presence_of_element_located((By.ID,idName) 
# here 10 is time for which script will try to find given id
# provide the id name
dom = BeautifulSoup(driver.page_source, parser)

f = dom.find('iframe', id='dsq-app1')
driver.switch_to_frame('dsq-app1')
s = driver.page_source
f.replace_with(BeautifulSoup(s, 'html.parser'))

with open('out.html', 'w') as fe:
    fe.write(dom.encode('utf-8'))

如果这不起作用，请纠正我

【讨论】：

这是我的第一次尝试，但它不能正常工作，因为该项目可能会在它真正完全加载之前出现。我想等待页面加载的整个主题是一个复杂的主题，但我通过检查 iframe 的 readyState 解决了这个问题。
我提供了我的解决方案，请查看以下内容