使用请求抓取所有时间的 subreddit 热门帖子会返回错误的结果答案

【问题标题】：Scraping subreddit top posts of all time using requests is returning the wrong result使用请求抓取所有时间的 subreddit 热门帖子会返回错误的结果
【发布时间】：2020-09-25 14:51:45
【问题描述】：

我想为他们所有时间的热门帖子抓取一个 subreddit。我知道有一个PRAW 模块可能效果更好，但我现在更愿意使用requests 进行抓取。

import requests

url = "https://www.reddit.com/r/shittysuperpowers/top/?t=all.html"
headers = {"User-agent": "bot_0.1"}
res = requests.get(url, headers=headers)

res.status_code 返回 200，抓取成功。但仔细检查res.text 发现，抓取的数据 html 不是来自所需的页面。事实上，被抓取的内容来自今天的热门帖子，而不是所有时间，或者来自此网址 https://www.reddit.com/r/shittysuperpowers/top/?t=day.html。有什么原因我无法抓取所有时间的热门帖子吗？我也尝试过使用其他 subreddits，它们都遇到了同样的问题。

【问题讨论】：

标签： python web-scraping python-requests reddit

【解决方案1】：

在查询前使用.json修饰符获取json格式的数据。

import requests
url = 'https://www.reddit.com/r/shittysuperpowers/top/.json?sort=top&t=all'
resp = requests.get(url, headers = {'User-agent': 'bot_0.1'})
if resp.ok:
    data = resp.json()

.json 修饰符也适用于浏览器。

【讨论】：