我需要帮助使用 Python/BeautifulSoup 从网页中提取嵌入式 .xlsx 链接答案

【问题标题】：I need help extracting embedded .xlsx link from a webpage using Python/BeautifulSoup我需要帮助使用 Python/BeautifulSoup 从网页中提取嵌入式 .xlsx 链接
【发布时间】：2021-02-04 23:56:13
【问题描述】：

我正在尝试从此website 访问一个 excel 表，以将其作为 DataFrame 引入。这是我所拥有的：

import pandas as pd
from urllib.request import urlopen
from bs4 import BeautifulSoup

url = 'https://tedb.ornl.gov/data/'
html = urlopen(url)
soup = BeautifulSoup(html, 'html.parser')

# Select all 'a' elements with href attributes containing URLs starting with https://
for link in soup.select('a[href^="https://"]'):
    href = link.get('href')
    print(href)

我想获取表 4.01，其链接在检查时包含在 HTML 元素中：

<a href="https://tedb.ornl.gov/wp-content/uploads/2020/06/Table4_01_06242020.xlsx">xlsx</a>

但是，当我运行我的代码时，我得到的只是以下链接：

https://www.ornl.gov
https://tedb.ornl.gov/
https://tedb.ornl.gov/data/
https://tedb.ornl.gov/archive/
https://tedb.ornl.gov/citation/
https://tedb.ornl.gov/contact/
https://tedb.ornl.gov/wp-content/uploads/2020/02/TEDB_Ed_38.pdf
https://tedb.ornl.gov/wp-content/uploads/2020/08/TEDB_38.2_Spreadsheets_08312020.zip
https://tedb.ornl.gov/wp-content/uploads/2020/08/Updates_08312020.pdf
https://www.ornl.gov/ornl/contact-us/Security--Privacy-Notice
https://www.ornl.gov/content/accessibility
https://www.ornl.gov/content/notice-nondiscrimination-and-accessibility-requirements

有人知道为什么我要找的excel链接没有显示吗？

【问题讨论】：

表格不在页面源中，而是使用javascript加载的。你需要一个无头浏览器来获取它。
你可以找到 xhr 提供包含所有链接的页面为https://tedb.ornl.gov/wp-admin/admin-ajax.php?action=wp_ajax_ninja_tables_public_action&table_id=3374&target_action=get-all-data&default_sorting=manual_sort - 它返回 json。

标签： python html excel beautifulsoup urllib

【解决方案1】：

表格是动态生成的，但有后端url可以查询。

方法如下：

import requests
from bs4 import BeautifulSoup

url = "https://tedb.ornl.gov/wp-admin/admin-ajax.php?action=wp_ajax_ninja_tables_public_action&table_id=3374&target_action=get-all-data&default_sorting=manual_sort"

response = requests.get(url).json()

for item in response:
    print(BeautifulSoup(item["value"]["excel"], "html.parser").find("a")["href"])

输出：

https://tedb.ornl.gov/wp-content/uploads/2020/04/Table1_01_04302020.xlsx
https://tedb.ornl.gov/wp-content/uploads/2020/04/Table1_02_04302020.xlsx
https://tedb.ornl.gov/wp-content/uploads/2020/04/Table1_03_04302020.xlsx
https://tedb.ornl.gov/wp-content/uploads/2020/04/Table1_04_04302020.xlsx
https://tedb.ornl.gov/wp-content/uploads/2020/08/Figure1_01_08312020.xlsx
https://tedb.ornl.gov/wp-content/uploads/2020/08/Figure1_02_08312020.xlsx
https://tedb.ornl.gov/wp-content/uploads/2020/08/Figure1_03_08312020.xlsx
https://tedb.ornl.gov/wp-content/uploads/2020/08/Table1_05_08312020.xlsx
https://tedb.ornl.gov/wp-content/uploads/2020/04/Table1_06_04302020.xlsx
https://tedb.ornl.gov/wp-content/uploads/2020/04/Table1_07_04302020.xlsx
https://tedb.ornl.gov/wp-content/uploads/2020/04/Table1_08_04302020.xlsx
https://tedb.ornl.gov/wp-content/uploads/2020/08/Figure1_04_08312020.xlsx
https://tedb.ornl.gov/wp-content/uploads/2020/04/Table1_09_04302020.xlsx
https://tedb.ornl.gov/wp-content/uploads/2020/04/Table1_10_04302020.xlsx
and much more...

【讨论】：

成功了，谢谢！我很好奇——你是如何识别后端 URL 的？它是否位于现有 HTML 中的某个位置？