在 Python 中抓取 onclick 表答案

【问题标题】：Scraping onclick tables in Python在 Python 中抓取 onclick 表
【发布时间】：2021-04-26 08:44:57
【问题描述】：

我正在尝试收集 2020 年 12 月的 Steams 硬件和软件调查（页面底部的表格）。通过单击其中一个父项（例如“OS 版本”），可以展开该表。我的目标是访问这些父母中的表。

https://store.steampowered.com/hwsurvey#main_stats

到目前为止，我已尝试使用 requests 和 BeautifulSoup（使用不同的解析器）检索此信息，但 Beautifulsoup 始终返回 TypeError: 'NoneType' object is not callable。在搜索 API 失败后，我尝试将 Selenium 与 pd.read_html() 结合使用。使用这种方法，我至少可以访问表格上方图表中的 y 标签，但不能访问下面所需的表格：

import pandas as pd
from selenium import webdriver

url = "https://store.steampowered.com/hwsurvey#main_stats"
opt = webdriver.FirefoxOptions()
opt.add_argument('-headless')
driver = webdriver.Firefox(options=opt)
driver.get(url)

pd.read_html(driver.page_source)

我很感激任何可以帮助我克服这个问题的建议。

【问题讨论】：

标签： python pandas selenium-webdriver beautifulsoup

【解决方案1】：

您可以从此代码重试：

import requests
from bs4 import BeautifulSoup

url="https://store.steampowered.com/hwsurvey#main_stats"
headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6),AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36"}

response = requests.get(url, headers=headers).text
soup =  BeautifulSoup(response,"html.parser")
names=soup.find_all("div",{"class":"stats_col_left"})
os=soup.find_all("span",{"id":"osversion_val_1_on"})
#val=soup.find_all("div",{"class":"stats_col_mid"})
list_names=list()
for i in names:
    i=i.text
    i=i.strip("\xa0 ")
    list_names.append(i)
    list_names = [x for x in list_names if x]

【讨论】：

在您的帮助下，我能够解决我的问题。非常感谢！