网络抓取的数据有时只有效

【问题标题】：Web-scraped data only works sometimes网络抓取的数据有时只有效
【发布时间】：2021-08-11 03:15:52
【问题描述】：

我正在尝试从网站上抓取股票收益数据。在市场时间之外，该代码有效。在市场交易时间内，代码大部分时间都会显示“列表索引超出范围”。我意识到这是因为我想要更改或退出以加载其他内容的数据上方的网站 html 代码，但是对此有什么可做的吗？还是我只是受制于网站所做的事情？

import requests
from bs4 import BeautifulSoup
headers = {'User Agent':'Mozilla/5.0'}
stocks = ['AAPL']
for stock in stocks:
    url = f'https://www.marketwatch.com/investing/stock/{stock}/analystestimates?mod=mw_quote_tab'
    res = requests.get(url, headers = headers)
    soup = BeautifulSoup(res.text, 'lxml')
    thisyear = soup.findAll('th', class_ = "table__cell")[8].text
    print(thisyear)

提前致谢。

【问题讨论】：

标签： python web-scraping beautifulsoup

【解决方案1】：

您主要受网站支配。如果可能，最好找到具有相同/相似数据的 API。

在没有看到跟踪的情况下，IndexError 可能来自[8]，或者更具体地说，来自soup.findAll('th', class_ = "table__cell")，返回的列表少于 9 个项目。

您可以在获取该值之前分配items = soup.findAll(..) 并检查if len(items) >= 9，和/或调用不同的抓取方法。您也可以将其包装在 try-catch 块中：

def main():
    for stock in stocks:
        try2scrape(stock)

def try2scrape(stock):
    try:
        return scrape_data(stock)
    except IndexError as e:
        return scrape_data_another_way(stock) # or just error

【讨论】：