为什么当我解析网站时，我会从网站获取旧数据，有时会从网站获取当前数据答案

【问题标题】：Why when i'm parsing a website i'm getting old data from site and sometimes current data in site为什么当我解析网站时，我会从网站获取旧数据，有时会从网站获取当前数据
【发布时间】：2020-09-21 20:21:03
【问题描述】：

我正在抓取一个网站，并且正在检索一个表格和一行日期。一切正常，但是当我运行我的脚本时，我会从站点获取当前数据，有时我会从站点获取昨天的值。

当我进入网站时，数据总是会更新。

这是我的代码的一部分，完整的代码在： http://pythonfiddle.com/lme

url = 'https://www.lme.com/en-gb/metals/non-ferrous/#tabIndex=0'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
dateFromSite = soup.find('div', class_='delayed-date').text.strip()

【问题讨论】：

我总是收到Data valid for 2 June 2020。这是正确的吗？
可能它正在缓存可能包含旧数据的页面版本，并使用 AJAX 获取更新的数字。但这是任何人的猜测......
是的，但我有时会收到 2020 年 6 月 1 日有效的数据
在描述中添加了完整的代码

标签： python python-3.x web web-scraping beautifulsoup

【解决方案1】：

查看从页面获取的 HTTP 标头，该站点正在使用 Cloudflare 来缓存请求。所以有时你会得到页面的“旧”版本。

您可以尝试使用 http 'Cache-Control: no-cache, must-revalidate' 标头和/或使用添加到 url 的随机参数来规避此问题。

例如：

import time

url = 'https://www.lme.com/en-gb/metals/non-ferrous/?_random_number={rn}#tabIndex=0'
headers = {'Cache-Control': 'no-cache, must-revalidate'}

r = requests.get(url.format(rn=time.time()), headers=headers)
#print(r.headers) # should print 'CF-Cache-Status': 'MISS' in headers
soup = BeautifulSoup(r.text, 'html.parser')
dateFromSite = soup.find('div', class_='delayed-date').text.strip()

print(dateFromSite)

【讨论】：