多线程python抓取需要锁？答案

【问题标题】：locks needed for multithreaded python scraping?多线程python抓取需要锁？
【发布时间】：2017-04-08 05:50:42
【问题描述】：

我有一个邮政编码列表，我想提取商业列表以使用 yelp fusion api。每个邮政编码必须至少进行一次 api 调用（通常更多），因此，我希望能够跟踪我的 api 使用情况，因为每日限制为 25000。我已将每个邮政编码定义为用户定义的区域设置的实例班级。这个语言环境类有一个类变量 Locale.pulls，它充当拉动次数的全局计数器。

我想使用 multiprocessing 模块进行多线程处理，但我不确定是否需要使用锁，如果需要，我该怎么做？问题是竞争条件，因为我需要确保每个线程都能看到当前的拉取次数，定义为下面伪代码中的 Zip.pulls 类变量。

import multiprocessing.dummy as mt 


class Locale():
    pulls = 0
    MAX_PULLS = 20000

    def __init__(self,x,y):
        #initialize the instance with arguments needed to complete the API call  

    def pull(self):
        if Locale.pulls > MAX_PULLS: 
            return none
        else: 
            # make the request, store the returned data and increment the counter
            self.data = self.call_yelp() 
            Locale.pulls += 1


def main():
    #zipcodes below is a list of arguments needed to initialize each zipcode as a Locale class object
    pool = mt.Pool(len(zipcodes)/100) # let each thread work on 100 zipcodes
    data = pool.map(Locale, zipcodes)

【问题讨论】：

标签： multithreading python-3.x

【解决方案1】：

一个简单的解决方案是在运行map() 之前检查len(zipcodes) < MAP_PULLS。

【讨论】：