【问题标题】:I need help extracting embedded .xlsx link from a webpage using Python/BeautifulSoup我需要帮助使用 Python/BeautifulSoup 从网页中提取嵌入式 .xlsx 链接
【发布时间】:2021-02-04 23:56:13
【问题描述】:

我正在尝试从此website 访问一个 excel 表,以将其作为 DataFrame 引入。这是我所拥有的:

import pandas as pd
from urllib.request import urlopen
from bs4 import BeautifulSoup

url = 'https://tedb.ornl.gov/data/'
html = urlopen(url)
soup = BeautifulSoup(html, 'html.parser')

# Select all 'a' elements with href attributes containing URLs starting with https://
for link in soup.select('a[href^="https://"]'):
    href = link.get('href')
    print(href)

我想获取表 4.01,其链接在检查时包含在 HTML 元素中:

<a href="https://tedb.ornl.gov/wp-content/uploads/2020/06/Table4_01_06242020.xlsx">xlsx</a>

但是,当我运行我的代码时,我得到的只是以下链接:

https://www.ornl.gov
https://tedb.ornl.gov/
https://tedb.ornl.gov/data/
https://tedb.ornl.gov/archive/
https://tedb.ornl.gov/citation/
https://tedb.ornl.gov/contact/
https://tedb.ornl.gov/wp-content/uploads/2020/02/TEDB_Ed_38.pdf
https://tedb.ornl.gov/wp-content/uploads/2020/08/TEDB_38.2_Spreadsheets_08312020.zip
https://tedb.ornl.gov/wp-content/uploads/2020/08/Updates_08312020.pdf
https://www.ornl.gov/ornl/contact-us/Security--Privacy-Notice
https://www.ornl.gov/content/accessibility
https://www.ornl.gov/content/notice-nondiscrimination-and-accessibility-requirements

有人知道为什么我要找的excel链接没有显示吗?

【问题讨论】:

  • 表格不在页面源中,而是使用javascript加载的。你需要一个无头浏览器来获取它。
  • 你可以找到 xhr 提供包含所有链接的页面为https://tedb.ornl.gov/wp-admin/admin-ajax.php?action=wp_ajax_ninja_tables_public_action&amp;table_id=3374&amp;target_action=get-all-data&amp;default_sorting=manual_sort - 它返回 json。

标签: python html excel beautifulsoup urllib


【解决方案1】:

表格是动态生成的,但有后端url可以查询。

方法如下:

import requests
from bs4 import BeautifulSoup

url = "https://tedb.ornl.gov/wp-admin/admin-ajax.php?action=wp_ajax_ninja_tables_public_action&table_id=3374&target_action=get-all-data&default_sorting=manual_sort"

response = requests.get(url).json()

for item in response:
    print(BeautifulSoup(item["value"]["excel"], "html.parser").find("a")["href"])

输出:

https://tedb.ornl.gov/wp-content/uploads/2020/04/Table1_01_04302020.xlsx
https://tedb.ornl.gov/wp-content/uploads/2020/04/Table1_02_04302020.xlsx
https://tedb.ornl.gov/wp-content/uploads/2020/04/Table1_03_04302020.xlsx
https://tedb.ornl.gov/wp-content/uploads/2020/04/Table1_04_04302020.xlsx
https://tedb.ornl.gov/wp-content/uploads/2020/08/Figure1_01_08312020.xlsx
https://tedb.ornl.gov/wp-content/uploads/2020/08/Figure1_02_08312020.xlsx
https://tedb.ornl.gov/wp-content/uploads/2020/08/Figure1_03_08312020.xlsx
https://tedb.ornl.gov/wp-content/uploads/2020/08/Table1_05_08312020.xlsx
https://tedb.ornl.gov/wp-content/uploads/2020/04/Table1_06_04302020.xlsx
https://tedb.ornl.gov/wp-content/uploads/2020/04/Table1_07_04302020.xlsx
https://tedb.ornl.gov/wp-content/uploads/2020/04/Table1_08_04302020.xlsx
https://tedb.ornl.gov/wp-content/uploads/2020/08/Figure1_04_08312020.xlsx
https://tedb.ornl.gov/wp-content/uploads/2020/04/Table1_09_04302020.xlsx
https://tedb.ornl.gov/wp-content/uploads/2020/04/Table1_10_04302020.xlsx
and much more...

【讨论】:

  • 成功了,谢谢!我很好奇——你是如何识别后端 URL 的?它是否位于现有 HTML 中的某个位置?
猜你喜欢
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2023-04-07
  • 1970-01-01
  • 2016-11-30
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
相关资源
最近更新 更多