【发布时间】:2015-02-27 17:45:10
【问题描述】:
要在 180 天后抓取特定网站,必须获取身份验证令牌才能获取要抓取的 json 数据。抓取时,令牌过期,HTTP 响应返回状态代码 401“未授权”。如何将新令牌放入刮板并继续刮板?任何帮助表示赞赏。
def start_requests(self):
return [Request(url=AUTHORIZATION_URL, callback=self.request_ride_times)]
def request_ride_times(self, response):
# parse json data
data = json.loads(response.body)
# get auth token
auth = '{}'.format(data['access_token'])
# set auth token in headers
headers = {'Authorization': 'BEARER {}'.format(auth)}
# note: this probably isn't really necessary but it doesn't hurt (all the sites times we are scraping are in EST)
now = get_current_time_for_timezone("US/Eastern")
# get ending timeframe for scraping dates - 190 days out
until = now + SCRAPE_TIMEFRAME
for filter_type in FILTER_TYPES:
filter_url_query_attr = '&filters={}'.format(filter_type)
scrape_date = now
while scrape_date <= until:
url = urljoin(SCRAPE_BASE_URL, '{}{}&date={}'.format(SCRAPE_BASE_URL_QUERY_STRING, filter_url_query_attr, scrape_date.strftime("%Y-%m-%d")))
yield Request(url, headers=headers, callback=self.parse_ride_times, errback=self.error_handler)
scrape_date += timedelta(days=1)
def parse_ride_times(self, response):
# parse json data
data = json.loads(response.body)
for index, ride_details in enumerate(data['results']):
if 'schedule' not in ride_details:
continue
ride_schedule = ride_details['schedule']
# create item...
yield item
【问题讨论】:
-
看到180天以后我笑了
-
该网站有一个从今天到今天起 180 天的时间表。我想获取每天的日程安排数据。这有意义吗?
-
我明白了,我只是觉得这很有趣。您最初是如何进行身份验证的?
-
@PadraicCunningham - 我最初使用 start_requests 函数进行身份验证,该函数将请求调用返回到获取令牌的网站。然后请求中的回调处理响应。示例:
def start_requests(self): return [Request(url=AUTHORIZATION_URL, callback=self.request_ride_times)]
标签: python authentication scrapy scrapy-spider