【问题标题】:invoke scrapy spider with rest api用rest api调用scrapy spider
【发布时间】:2017-10-29 20:05:26
【问题描述】:

需要帮助

我做了一个蜘蛛,唯一的事情就是从搜索引擎站点产生 url 和描述

我的蜘蛛代码是:

def __init__(self, keyword, se = 'bing', pages = 50,  *args, **kwargs):
    super(KeywordSpider, self).__init__(*args, **kwargs)
    self.keyword = keyword.lower()
    self.searchEngine = se.lower()
    self.selector = SearchEngineResultSelectorsURL[self.searchEngine]
    self.image_selector = SearchEngineResultSelectorsIMAGES[self.searchEngine]
    pageUrls = searResultPages(keyword, se, int(pages))
    for url in pageUrls:
        print(url)
        self.start_urls.append(url)

def parse(self, response):
    images_dict = dict()
    images_dict['images'] = []
    for url in Selector(response).xpath(self.selector):


        yield {'url':''.join(url.xpath('h2/a/@href').extract()).strip(),
               'title':''.join(url.xpath('h2/a//text()').extract()).strip(),


               }

我现在需要做的是:

  1. 使用接受关键字和搜索引擎的 REST 接口调用此蜘蛛
  2. 以 json 格式返回响应

示例: 我需要运行一个服务器 - 打开 rest api - 做:

localhost:5000/search?keyword={0}&search_engine={1}

服务器将需要调用蜘蛛并与他一起爬行 获得结果时 - 他需要以 json 格式将它们发送回服务器

我做的是:

class Search(resource.Resource):
isLeaf = True
def render_GET(self, request):
    args = request.args

    added_images_url=False
    count_results =0
    # here we want to get the value of user (i.e. ?user=some-value)
    if b'keyword' not in  args:
        request.setResponseCode(400)
        return bytes('no keyword param','utf-8')
    if b'search_engine' not in  args:
        request.setResponseCode(400)
        return bytes('no search_engine param','utf-8')
    if b'num_of_results' not in  args:
        request.setResponseCode(400)
        return bytes('no num_of_results param','utf-8')

    keyword,search_engine,num_of_results = self.decode_values_from_dict(args)
    process = CrawlerProcess({
        'USER_AGENT':'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)',
        'FEED_FORMAT':'json',
        'FEED_URI':'search_results.json',
        'ITEM_PIPELINES':{'BingCrawler.main.MyPipeline': 1}
        })
    if os.path.isfile(FILE_NAME):
        os.remove(FILE_NAME)



    process.crawl(KeywordSpider,keyword=keyword,se = search_engine,\
                  pages=(int(math.ceil(int(num_of_results)/NUMBER_OF_RESULTS_PER_SEARCH_ENGINE[search_engine.upper()]))))


    process.start()
    json_file = open(FILE_NAME,'r').read()
    json_obj_items = json.loads(json_file)
    result_items = [item for item in json_obj_items if 'images_url' not in item]
    image_item = [item for item in json_obj_items if 'images_url' in item]
    result_item_requested_amount = result_items[0:int(num_of_results)]
    result_item_requested_amount.extend(image_item)   
    return json.dumps(result_item_requested_amount)




def decode_values_from_dict(self,args):
    return args[b'keyword'][0].decode('utf-8'),args[b'search_engine'][0].decode('utf-8'),args[b'num_of_results'][0].decode('utf-8') 

root = Search()
factory = server.Site(root)
reactor.listenTCP(8080, factory)
reactor.run()

但我得到 reactorAlreadyRunning 异常

我需要使用非阻塞 - asynchrius api 尝试与twisted合作,没有成功

请帮忙...谢谢! (:

【问题讨论】:

  • 欢迎来到 Stackoverflow。请查看these guidelines 以帮助您撰写第一篇文章。就目前而言,您的问题太宽泛了——您需要提出一个更具体的问题。

标签: python scrapy twisted tornado


【解决方案1】:

如果您要在 Twisted 应用程序中使用蜘蛛,请使用 CrawlerRunner 而不是 CrawlerProcess。这应该可以解决您的reactorAlreadyRunning 问题。

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 2013-01-29
    • 1970-01-01
    • 2021-10-29
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2017-03-19
    • 2014-06-14
    相关资源
    最近更新 更多