如何使用 Selenium 解析网站上的表格内容？答案

【问题标题】：How can I parse the table content from the website using Selenium?如何使用 Selenium 解析网站上的表格内容？
【发布时间】：2018-07-17 06:28:11
【问题描述】：

我正在尝试将体育网站中存在的表格解析为字典列表以呈现为模板，这是我第一次接触 selenium，我尝试阅读 selenium 文档并编写了这个程序

from bs4 import BeautifulSoup
import time
from selenium import webdriver

url = "http://www.espncricinfo.com/rankings/content/page/211270.html"
browser = webdriver.Chrome()

browser.get(url)
time.sleep(3)
html = browser.page_source
soup = BeautifulSoup(html, "lxml")

print(len(soup.find_all("table")))
print(soup.find("table", {"class": "ratingstable"}))

browser.close()
browser.quit()

我得到的值是 0 而没有，如何修改以获取表的所有值并将其存储在字典列表中？如果您有任何其他问题，请随时提出。

【问题讨论】：

标签： python python-3.x selenium parsing beautifulsoup

【解决方案1】：

首先，避免使用time.sleep()。这违反了所有最佳实践。使用Explicit Wait。

如果您检查该表，您可以看到它位于带有name="testbat" 的<iframe> 标记内。因此，您必须切换到该框架才能获取表格的内容。可以这样做：

browser.switch_to.default_content()
browser.switch_to.frame('testbat')

切换帧后，使用上面提到的显式等待。

完整代码：

from bs4 import BeautifulSoup
from selenium import webdriver

# Add the following imports to your program
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException

url = "http://www.espncricinfo.com/rankings/content/page/211270.html"
browser = webdriver.Chrome()
browser.get(url)

browser.switch_to.default_content()
browser.switch_to.frame('testbat')

try:
    WebDriverWait(browser, 10).until(EC.presence_of_element_located((By.CLASS_NAME, 'ratingstable')))
except TimeoutException:
    pass  # Handle the time out exception

html = browser.find_element_by_class_name('ratingstable').get_attribute('innerHTML')
soup = BeautifulSoup(html, "lxml")

你可以检查你是否有桌子：

>>> print('S.P.D. Smith' in html)
True

【讨论】：

如何处理页面中的其他表格，它们都包含相同的框架名称？我试图用不同的属性调用我得到错误说没有这样的框架
我认为您应该问另一个问题-如何选择具有相同名称或类似名称的不同框架。