【发布时间】:2017-10-29 20:05:26
【问题描述】:
需要帮助
我做了一个蜘蛛,唯一的事情就是从搜索引擎站点产生 url 和描述
我的蜘蛛代码是:
def __init__(self, keyword, se = 'bing', pages = 50, *args, **kwargs):
super(KeywordSpider, self).__init__(*args, **kwargs)
self.keyword = keyword.lower()
self.searchEngine = se.lower()
self.selector = SearchEngineResultSelectorsURL[self.searchEngine]
self.image_selector = SearchEngineResultSelectorsIMAGES[self.searchEngine]
pageUrls = searResultPages(keyword, se, int(pages))
for url in pageUrls:
print(url)
self.start_urls.append(url)
def parse(self, response):
images_dict = dict()
images_dict['images'] = []
for url in Selector(response).xpath(self.selector):
yield {'url':''.join(url.xpath('h2/a/@href').extract()).strip(),
'title':''.join(url.xpath('h2/a//text()').extract()).strip(),
}
我现在需要做的是:
- 使用接受关键字和搜索引擎的 REST 接口调用此蜘蛛
- 以 json 格式返回响应
示例: 我需要运行一个服务器 - 打开 rest api - 做:
localhost:5000/search?keyword={0}&search_engine={1}
服务器将需要调用蜘蛛并与他一起爬行 获得结果时 - 他需要以 json 格式将它们发送回服务器
我做的是:
class Search(resource.Resource):
isLeaf = True
def render_GET(self, request):
args = request.args
added_images_url=False
count_results =0
# here we want to get the value of user (i.e. ?user=some-value)
if b'keyword' not in args:
request.setResponseCode(400)
return bytes('no keyword param','utf-8')
if b'search_engine' not in args:
request.setResponseCode(400)
return bytes('no search_engine param','utf-8')
if b'num_of_results' not in args:
request.setResponseCode(400)
return bytes('no num_of_results param','utf-8')
keyword,search_engine,num_of_results = self.decode_values_from_dict(args)
process = CrawlerProcess({
'USER_AGENT':'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)',
'FEED_FORMAT':'json',
'FEED_URI':'search_results.json',
'ITEM_PIPELINES':{'BingCrawler.main.MyPipeline': 1}
})
if os.path.isfile(FILE_NAME):
os.remove(FILE_NAME)
process.crawl(KeywordSpider,keyword=keyword,se = search_engine,\
pages=(int(math.ceil(int(num_of_results)/NUMBER_OF_RESULTS_PER_SEARCH_ENGINE[search_engine.upper()]))))
process.start()
json_file = open(FILE_NAME,'r').read()
json_obj_items = json.loads(json_file)
result_items = [item for item in json_obj_items if 'images_url' not in item]
image_item = [item for item in json_obj_items if 'images_url' in item]
result_item_requested_amount = result_items[0:int(num_of_results)]
result_item_requested_amount.extend(image_item)
return json.dumps(result_item_requested_amount)
def decode_values_from_dict(self,args):
return args[b'keyword'][0].decode('utf-8'),args[b'search_engine'][0].decode('utf-8'),args[b'num_of_results'][0].decode('utf-8')
和
root = Search()
factory = server.Site(root)
reactor.listenTCP(8080, factory)
reactor.run()
但我得到 reactorAlreadyRunning 异常
我需要使用非阻塞 - asynchrius api 尝试与twisted合作,没有成功
请帮忙...谢谢! (:
【问题讨论】:
-
欢迎来到 Stackoverflow。请查看these guidelines 以帮助您撰写第一篇文章。就目前而言,您的问题太宽泛了——您需要提出一个更具体的问题。
标签: python scrapy twisted tornado