【问题标题】:Scrapy - scraped website authentication token expires while scrapingScrapy - 抓取的网站身份验证令牌在抓取时过期
【发布时间】:2015-02-27 17:45:10
【问题描述】:

要在 180 天后抓取特定网站,必须获取身份验证令牌才能获取要抓取的 json 数据。抓取时,令牌过期,HTTP 响应返回状态代码 401“未授权”。如何将新令牌放入刮板并继续刮板?任何帮助表示赞赏。

def start_requests(self):
    return [Request(url=AUTHORIZATION_URL, callback=self.request_ride_times)]

def request_ride_times(self, response):
    # parse json data
    data = json.loads(response.body)

    # get auth token
    auth = '{}'.format(data['access_token'])

    # set auth token in headers
    headers = {'Authorization': 'BEARER {}'.format(auth)}

    # note: this probably isn't really necessary but it doesn't hurt (all the sites times we are scraping are in EST)
    now = get_current_time_for_timezone("US/Eastern")

    # get ending timeframe for scraping dates - 190 days out
    until = now + SCRAPE_TIMEFRAME

    for filter_type in FILTER_TYPES:
        filter_url_query_attr = '&filters={}'.format(filter_type)

        scrape_date = now

        while scrape_date <= until:
            url = urljoin(SCRAPE_BASE_URL, '{}{}&date={}'.format(SCRAPE_BASE_URL_QUERY_STRING, filter_url_query_attr, scrape_date.strftime("%Y-%m-%d")))
            yield Request(url, headers=headers, callback=self.parse_ride_times, errback=self.error_handler)

            scrape_date += timedelta(days=1)

def parse_ride_times(self, response):
    # parse json data
    data = json.loads(response.body)

    for index, ride_details in enumerate(data['results']):

        if 'schedule' not in ride_details:
            continue

        ride_schedule = ride_details['schedule']

        # create item...

            yield item

【问题讨论】:

  • 看到180天以后我笑了
  • 该网站有一个从今天到今天起 180 天的时间表。我想获取每天的日程安排数据。这有意义吗?
  • 我明白了,我只是觉得这很有趣。您最初是如何进行身份验证的?
  • @PadraicCunningham - 我最初使用 start_requests 函数进行身份验证,该函数将请求调用返回到获取令牌的网站。然后请求中的回调处理响应。示例:def start_requests(self): return [Request(url=AUTHORIZATION_URL, callback=self.request_ride_times)]

标签: python authentication scrapy scrapy-spider


【解决方案1】:

我能够弄清楚这一点。我必须重写 Request 对象,以便在令牌过期时将新的授权令牌设置到标头中。我将令牌设为全局变量。

# override Request object in order to set new authorization token into the header when the token expires        
authorization_token = None

class AuthTokenRequest(Request):
    @property
    def headers(self):
        global authorization_token
        return Headers({'Authorization': 'BEARER {}'.format(authorization_token)}, encoding=self.encoding)

    @headers.setter
    def headers(self, value):
        pass

然后在 while 循环中的请求中使用被覆盖的请求,包括一个 errback 函数 error_handler,该函数在请求失败时被调用。 error_handler 函数获取新令牌,重置全局令牌变量,然后使用新令牌重新提交请求。在同一请求中,dont_filter 参数已设置为 False,因此可以重新处理失败的请求。

又创建了两个函数。创建一个名为 handle_auth 的方法是在全局变量中初始设置令牌。另一个是start_first_run,它调用handle_auth 并返回request_ride_times 函数。这在 start_requests 请求中调用。

def error_handler(self, failure):
    global authorization_token
    status = failure.value.response.status
    if status == 401:
        form_data = {'grant_type': 'assertion', 'assertion_type': 'public', 'client_id': 'WDPRO-MOBILE.CLIENT-PROD'}
        auth_site_request = requests.post(url=AUTHORIZATION_URL, data=form_data)
        auth_site_response = json.loads(auth_site_request.text)
        disney_authorization_token = '{}'.format(auth_site_response['access_token'])

        yield failure.request

def start_requests(self):
    form_data = {'grant_type': 'assertion', 'assertion_type': 'public', 'client_id': 'WDPRO-MOBILE.CLIENT-PROD'}
    return [FormRequest(url=AUTHORIZATION_URL, formdata=form_data,
                        callback=self.start_first_run)]

def start_first_run(self, response):
    self.handle_auth(response)
    return self.request_ride_times()

def handle_auth(self, response):
    global authorization_token

    data = json.loads(response.body)

    # get auth token
    authorization_token = '{}'.format(data['access_token'])

def request_ride_times(self):
    # note: this probably isn't really necessary but it doesn't hurt (all the sites we are scraping are in EST)
    now = get_current_time_for_timezone("US/Eastern")

    # get ending timeframe for scraping dates - 190 days out
    until = now + SCRAPE_TIMEFRAME

    for filter_type in FILTER_TYPES:
        filter_url_query_attr = '&filters={}'.format(filter_type)

        scrape_date = now

        while scrape_date <= until:
            url = urljoin(SCRAPE_BASE_URL,
                          '{}{}&date={}'.format(SCRAPE_BASE_URL_QUERY_STRING,
                                                filter_url_query_attr, scrape_date.strftime("%Y-%m-%d")))
            yield AuthTokenRequest(url, callback=self.parse_ride_times, errback=self.error_handler, dont_filter=True,
                                meta={"scrape_date": scrape_date})

            scrape_date += timedelta(days=1)

def parse_ride_times(self, response):
    # parse json data
    data = json.loads(response.body)
    # process data...

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 2016-09-13
    • 2017-01-02
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2014-08-16
    • 1970-01-01
    • 2014-04-01
    相关资源
    最近更新 更多