使用请求和 BeautifulSoup 抓取重定向的站点答案

【问题标题】：Scraping a redirected site using requests and BeautifulSoup使用请求和 BeautifulSoup 抓取重定向的站点
【发布时间】：2017-03-26 14:11:06
【问题描述】：

我正在使用 requests 和 BeautifulSoup4 来抓取 NBA 网站。

from bs4 import BeautifulSoup
import requests

r = requests.get('http://www.nba.com/games/20111225/BOSNYK/boxscore.html')
soup = BeautifulSoup(r.text)

当它进入浏览器时，该站点的 url 实际上会导致“http://www.nba.com/games/20111225/BOSNYK/gameinfo.html#nbaGIboxscore”，我认为使用requests 是模拟这个的正确方法。

问题是我不知道这个效果的关键字，在网上找不到解决办法。

【问题讨论】：

标签： python-2.7 web-scraping beautifulsoup python-requests

【解决方案1】：

您可以使用regex 或bs4 来找到重定向的站点，然后使用requests 来抓取他。

例如：

import bs4
import requests

original_url = 'http://www.nba.com/games/20111225/BOSNYK/'
old_suffix = 'boxscore.html'
r = requests.get(original_url + old_suffix)
site_content = bs4.BeautifulSoup(r.text, 'lxml')
meta = site_content.find_all('meta')[0]
meta_content = meta.attrs.get('content')
new_suffix = meta.attrs.get('content')[6:]
new_url_to_scrape = original_url + new_suffix

然后刮掉new_url_to_scarpe。享受吧！

【讨论】：

很好，我现在明白了！