Urllib2 + pool.map中超时异常和time.sleep的处理答案

【问题标题】：Dealing with timeout exceptions and time.sleep in Urllib2 + pool.mapUrllib2 + pool.map中超时异常和time.sleep的处理
【发布时间】：2023-03-10 20:22:01
【问题描述】：

我是 python 新手，我编写了一些代码来从 Web API 下载数据。但是，在使用 API 时，我必须遵守一些限制：

每个 API 密钥每秒 1 个请求
如果发生超时，请等待 30 秒后再重试
每个 API 密钥每天最多 10 万个请求

向 Web API 发出请求的方法的代码是：

def getMatchDetails(self,match_id):
    '''Calls the WEB Api and requests the data for the match with
    a specific id (in match_id). Then returns the data already decoded 
    from json.'''
    import urllib2
    import json
    import time
    url = self.__makeUrl__(api_key= self.api_key, parameters = ['match_id='+str(match_id)])
    # Sometimes a time out occurs, we keep trying
    while True:
        try:
            start = time.time()
            json_obj = urllib2.urlopen(url)
            end = time.time()
            if end - start < 1:
                time.sleep(1 - (end - start))
        except:
            print('Timed Out, Trying again in 30 seconds')
            time.sleep(30)
            continue
        else:
            break
    detailed_data = json.load(json_obj)
    return detailed_data

makeUrl 方法只是简单地连接一些字符串并返回它们。为了在每次调用上述方法时更改 API 密钥，我使用：

def getMatchDetailsForMap(self,match_id):
    self.counter += 1
    self.api_key = self.api_keys[self.counter%len(self.api_keys)]
    return self.getMatchDetails(match_id)

其中 self.api_keys 是一个包含我所有 API 密钥的列表。然后，我在以下代码中将方法 getMatchDetailsForMap 与 map 函数一起使用：

from multiprocessing.dummy import Pool as ThreadPool
pool = ThreadPool(14)
ids_to_get = self.__idsToGetChunks__(14)
for chunk in ids_to_get:
        results = pool.map(self.getMatchDetailsForMap,chunk)

idsToGetChunks 方法返回带有参数 (match_id) 的列表（块），这些参数被馈送到 getMatchDetailsForMap 方法。

问题：

通过代码实验，我意识到每个键 1 秒的限制不成立；这是为什么呢？
发生超时时，确实会减慢获取数据的过程；使用地图时是否有更好的方法来处理这种异常？（请提示）

感谢您的阅读和帮助！抱歉，帖子太长了。

【问题讨论】：

标签： python-2.7 multiprocessing urllib2

【解决方案1】：

为了符合这三个要求，我建议编写一个简单的for 循环，每个循环执行一个请求。通常，等待一秒钟。如果发生超时，请等待 30 秒。不要循环超过 100k 次。（我假设这个脚本每天运行一次，并且需要不到 24 小时；））

主程序为每个 API 密钥启动一个 Process。

简单！

来源

# 1 request per second per API key
# If a timeout occurs, wait 30 seconds before trying again
# Limit of 100k requests per day per API key

import logging, time, urllib2
import multiprocessing as mp

def do_fetch(key, timeout):
    return urllib2.urlopen(
        'http://example.com', timeout=timeout
    ).read()

def get_data(api_key):
    logger = mp.get_logger()
    data = None
    # Limit of 100k requests per day per API key
    for num in range(100*1000): 
        t = 1 if num!=1 else 0 # test timeout exception
        try:
            data = do_fetch(api_key, timeout=t)
            logger.info('%d bytes', len(data))
        except urllib2.URLError as exc:
            logger.error('exc: %s', repr(exc))
            # If a timeout occurs, wait 30 seconds before trying again
            time.sleep(3)
        else:
            # "1 request per second per API key"
            time.sleep(1)


mp.log_to_stderr(level=logging.INFO)
keys = [123, 234]
pool = mp.Pool(len(keys))
pool.map( get_data, keys )

输出

[INFO/PoolWorker-1] child process calling self.run()
[INFO/PoolWorker-2] child process calling self.run()
[INFO/PoolWorker-2] 1270 bytes
[INFO/PoolWorker-1] 1270 bytes
[ERROR/PoolWorker-2] exc: URLError(error(115, 'Operation now in progress'),)
[ERROR/PoolWorker-1] exc: URLError(error(115, 'Operation now in progress'),)
[INFO/PoolWorker-2] 1270 bytes
[INFO/PoolWorker-1] 1270 bytes

【讨论】：