Selenium Web Scraping with Beautiful Soup on Dynamic Content 和隐藏数据表答案

【问题标题】：Selenium Web Scraping With Beautiful Soup on Dynamic Content and Hidden Data TableSelenium Web Scraping with Beautiful Soup on Dynamic Content 和隐藏数据表
【发布时间】：2018-02-13 19:09:27
【问题描述】：

真的需要这个社区的帮助！

我正在使用 Selenium 和 Beautiful Soup 对 Python 中的动态内容进行网页抓取。问题是定价数据表无法解析为 Python，即使使用以下代码：

html=browser.execute_script('return document.body.innerHTML')
sel_soup=BeautifulSoup(html, 'html.parser')

然而，我后来发现，如果我在使用上述代码之前点击网页上的“查看所有价格”按钮，我可以将该数据表解析为 python。

我的问题是如何在不使用 Selenium 单击所有“查看所有价格”按钮的情况下解析和访问我的 python 中那些隐藏的动态 td 标签信息，因为有这么多。

我正在执行 Web Scraping 的网站的网址是 https://www.cruisecritic.com/cruiseto/cruiseitineraries.cfm?port=122，附图是我需要的动态数据表的html。 enter image description here

非常感谢这个社区的帮助！

【问题讨论】：

标签： python selenium dynamic web-scraping beautifulsoup

【解决方案1】：

您应该在加载后定位元素并通过arguments[0]而不是通过document获取整个页面

html_of_interest=driver.execute_script('return arguments[0].innerHTML',element)
sel_soup=BeautifulSoup(html_of_interest, 'html.parser')

这有2个实际案例：

1

该元素尚未加载到 DOM 中，您需要等待该元素：

browser.get("url")
sleep(experimental) # usually get will finish only after the page is loaded but sometimes there is some JS woo running after on load time

try:
    element= WebDriverWait(browser, delay).until(EC.presence_of_element_located((By.ID, 'your_id_of_interest')))
    print "element is ready do the thing!"
    html_of_interest=driver.execute_script('return arguments[0].innerHTML',element)
    sel_soup=BeautifulSoup(html_of_interest, 'html.parser')
except TimeoutException:
    print "Somethings wrong!"

2

该元素位于影子根中，您需要先扩展影子根，可能不是您的情况，但我会在这里提及它，因为它与将来参考有关。例如：

import selenium
from selenium import webdriver
driver = webdriver.Chrome()
from bs4 import BeautifulSoup


def expand_shadow_element(element):
  shadow_root = driver.execute_script('return arguments[0].shadowRoot', element)
  return shadow_root

driver.get("chrome://settings")
root1 = driver.find_element_by_tag_name('settings-ui')

html_of_interest=driver.execute_script('return arguments[0].innerHTML',root1)
sel_soup=BeautifulSoup(html_of_interest, 'html.parser')
sel_soup# empty root not expande

shadow_root1 = expand_shadow_element(root1)

html_of_interest=driver.execute_script('return arguments[0].innerHTML',shadow_root1)
sel_soup=BeautifulSoup(html_of_interest, 'html.parser')
sel_soup

【讨论】：

非常感谢您的详细解释！我会尽快尝试。再次感谢您！