【发布时间】:2014-01-23 04:32:49
【问题描述】:
我对 Python 非常陌生,并且正在学习如何抓取网页(1 天后)。我要完成的任务是遍历 2000 家公司的列表并提取收入数据和员工人数。我从使用scrapy开始,我已经设法让工作流程为一家公司工作(不优雅,但至少我正在尝试) - 但我不知道如何加载公司列表并循环执行多次搜索。我有一种感觉,这是一个相当简单的过程。
所以,我的主要问题是 - 我应该在蜘蛛类的哪个位置定义要循环的公司查询数组?我不知道确切的 URL,因为每个公司都有一个唯一的 ID 并且属于特定的市场 - 所以我不能将它们输入为 start_urls。
Scrapy 是正确的工具还是我应该使用 mechanize - 来完成这类任务?
这是我当前的代码。
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.http import FormRequest
from scrapy.http import Request
from tutorial.items import DmozItem
import json
class DmozSpider(BaseSpider):
name = "dmoz"
allowed_domains = ["proff.se"]
start_urls = ["http://www.proff.se"]
# Search on the website, currently I have just put in a static search term here, but I would like to loop over a list of companies.
def parse(self, response):
return FormRequest.from_response(response, formdata={'q': rebtel},callback=self.search_result)
# I fetch the url from the search result and convert it to correct Financial-url where the information is located.
def search_result(self,response):
sel = HtmlXPathSelector(response)
link = sel.xpath('//ul[@class="company-list two-columns"]/li/a/@href').extract()
finance_url=str(link[0]).replace("/foretag","http://www.proff.se/nyckeltal")
return Request(finance_url,callback=self.parse_finance)
# I Scraped the information of this particular company, this is hardcoded and will not
# work for other responses. I had some issues with the encoding characters
# initially since they were Swedish. I also tried to target the Json element direct by
# revenue = sel.xpath('#//*[@id="accountTable1"]/tbody/tr[3]/@data-chart').extract()
# but was not able to parse it (error - expected string or buffer - tried to convert it
# to a string by str() with no luck, something off with the formatting, which is messing the the data types.
def parse_finance(self, response):
sel = HtmlXPathSelector(response)
datachart = sel.xpath('//tr/@data-chart').extract()
employees=json.loads(datachart[36])
revenue = json.loads(datachart[0])
items = []
item = DmozItem()
item['company']=response.url.split("/")[-5]
item['market']=response.url.split("/")[-3]
item['employees']=employees
item['revenue']=revenue
items.append(item)
return item
【问题讨论】:
标签: python python-3.x web-scraping scrapy web-crawler