【发布时间】:2021-02-04 23:56:13
【问题描述】:
我正在尝试从此website 访问一个 excel 表,以将其作为 DataFrame 引入。这是我所拥有的:
import pandas as pd
from urllib.request import urlopen
from bs4 import BeautifulSoup
url = 'https://tedb.ornl.gov/data/'
html = urlopen(url)
soup = BeautifulSoup(html, 'html.parser')
# Select all 'a' elements with href attributes containing URLs starting with https://
for link in soup.select('a[href^="https://"]'):
href = link.get('href')
print(href)
我想获取表 4.01,其链接在检查时包含在 HTML 元素中:
<a href="https://tedb.ornl.gov/wp-content/uploads/2020/06/Table4_01_06242020.xlsx">xlsx</a>
但是,当我运行我的代码时,我得到的只是以下链接:
https://www.ornl.gov
https://tedb.ornl.gov/
https://tedb.ornl.gov/data/
https://tedb.ornl.gov/archive/
https://tedb.ornl.gov/citation/
https://tedb.ornl.gov/contact/
https://tedb.ornl.gov/wp-content/uploads/2020/02/TEDB_Ed_38.pdf
https://tedb.ornl.gov/wp-content/uploads/2020/08/TEDB_38.2_Spreadsheets_08312020.zip
https://tedb.ornl.gov/wp-content/uploads/2020/08/Updates_08312020.pdf
https://www.ornl.gov/ornl/contact-us/Security--Privacy-Notice
https://www.ornl.gov/content/accessibility
https://www.ornl.gov/content/notice-nondiscrimination-and-accessibility-requirements
有人知道为什么我要找的excel链接没有显示吗?
【问题讨论】:
-
表格不在页面源中,而是使用javascript加载的。你需要一个无头浏览器来获取它。
-
你可以找到 xhr 提供包含所有链接的页面为
https://tedb.ornl.gov/wp-admin/admin-ajax.php?action=wp_ajax_ninja_tables_public_action&table_id=3374&target_action=get-all-data&default_sorting=manual_sort- 它返回 json。
标签: python html excel beautifulsoup urllib