【问题标题】:BeautifulSoup doesn't find tables on webpageBeautifulSoup 在网页上找不到表格
【发布时间】:2021-01-24 18:15:24
【问题描述】:

我正在尝试从网站上的第一个表中获取数据。我在这里查看了类似的问题并尝试了许多给定的解决方案,但似乎无法找到表格并最终找到表格中的数据。

我试过了:

from bs4 import BeautifulSoup  
from selenium import webdriver  
driver = webdriver.Chrome('C:\\folder\\chromedriver.exe')  
url = 'https://docs.microsoft.com/en-us/windows/release-information/'  
driver.get(url)  

tbla = driver.find_element_by_name('table') #attempt using by element name  
tblb = driver.find_element_by_class_name('cells-centered') #attempt using by class name  
tblc = driver.find_element_by_xpath('//*[@id="winrelinfo_container"]/table[1]') #attempt by using xpath  

并尝试使用美丽的汤

html = driver.page_source
soup = BeautifulSoup(html,'html.parser')
table = soup.find("table", {"class": "cells-centered"})
print(len(table))

非常感谢任何帮助。

【问题讨论】:

    标签: python selenium iframe beautifulsoup webdriverwait


    【解决方案1】:

    存在于iframe 中,您需要先切换iframe 才能访问table

    诱导WebDriverWait() 并等待frame_to_be_available_and_switch_to_it() 并跟随定位器。

    诱导WebDriverWait() 并等待visibility_of_element_located() 和跟随定位器。

    driver.get("https://docs.microsoft.com/en-us/windows/release-information/")
    WebDriverWait(driver,10).until(EC.frame_to_be_available_and_switch_to_it((By.ID,"winrelinfo_iframe")))
    table=WebDriverWait(driver,10).until(EC.visibility_of_element_located((By.CSS_SELECTOR,"table.cells-centered")))
    

    您需要导入以下库。

    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support import expected_conditions as EC
    

    或者你使用下面的代码和xpath

    driver.get("https://docs.microsoft.com/en-us/windows/release-information/")
    WebDriverWait(driver,10).until(EC.frame_to_be_available_and_switch_to_it((By.ID,"winrelinfo_iframe")))
    table=WebDriverWait(driver,10).until(EC.presence_of_element_located((By.XPATH,'//*[@id="winrelinfo_container"]/table[1]')))
    

    您可以进一步将表格数据导入 pandas 数据框,然后导出为 csv 文件。您需要导入 pandas。

    driver.get("https://docs.microsoft.com/en-us/windows/release-information/")
    WebDriverWait(driver,10).until(EC.frame_to_be_available_and_switch_to_it((By.ID,"winrelinfo_iframe")))
    table=WebDriverWait(driver,10).until(EC.presence_of_element_located((By.XPATH,'//*[@id="winrelinfo_container"]/table[1]'))).get_attribute('outerHTML')
    df=pd.read_html(str(table))[0]
    print(df)
    df.to_csv("path/to/csv")
    

    导入熊猫:pip install pandas

    然后添加下面的库

    import pandas as pd
    

    【讨论】:

    • 啊,iframe...还有很多东西要学:-),这正是我需要的,我的下一步是创建一个包含要导入 SQL 数据库的数据的 df,所以这个非常完美,非常感谢@KunduK,非常感谢!
    【解决方案2】:

    表格位于<iframe> 内部,因此BeautifulSoup 在原始页面中看不到它:

    import requests 
    from bs4 import BeautifulSoup
    
    
    url = 'https://docs.microsoft.com/en-us/windows/release-information/'
    soup = BeautifulSoup(requests.get(url).content, 'html.parser')
    soup = BeautifulSoup(requests.get(soup.select_one('iframe')['src']).content, 'html.parser')
    
    for row in soup.select('table tr'):
        print(row.get_text(strip=True, separator='\t'))
    

    打印:

    Version Servicing option    Availability date   OS build    Latest revision date    End of service: Home, Pro, Pro Education, Pro for Workstations and IoT Core End of service: Enterprise, Education and IoT Enterprise
    2004    Semi-Annual Channel 2020-05-27  19041.546   2020-10-01  2021-12-14  2021-12-14  Microsoft recommends
    1909    Semi-Annual Channel 2019-11-12  18363.1110  2020-09-16  2021-05-11  2022-05-10
    1903    Semi-Annual Channel 2019-05-21  18362.1110  2020-09-16  2020-12-08  2020-12-08
    1809    Semi-Annual Channel 2019-03-28  17763.1490  2020-09-16  2020-11-10  2021-05-11
    1809    Semi-Annual Channel (Targeted)  2018-11-13  17763.1490  2020-09-16  2020-11-10  2021-05-11
    1803    Semi-Annual Channel 2018-07-10  17134.1726  2020-09-08  End of service  2021-05-11
    
    ...and so on.
    

    【讨论】:

    • 这也解决了问题,感谢您的帮助,非常感谢
    猜你喜欢
    • 1970-01-01
    • 2019-02-23
    • 1970-01-01
    • 2017-07-07
    • 2020-02-25
    • 2014-03-23
    • 2020-05-24
    • 1970-01-01
    • 2022-10-17
    相关资源
    最近更新 更多