【问题标题】:Scrapy multiple search terms抓取多个搜索词
【发布时间】:2014-01-23 04:32:49
【问题描述】:

我对 Python 非常陌生,并且正在学习如何抓取网页(1 天后)。我要完成的任务是遍历 2000 家公司的列表并提取收入数据和员工人数。我从使用scrapy开始,我已经设法让工作流程为一家公司工作(不优雅,但至少我正在尝试) - 但我不知道如何加载公司列表并循环执行多次搜索。我有一种感觉,这是一个相当简单的过程。

所以,我的主要问题是 - 我应该在蜘蛛类的哪个位置定义要循环的公司查询数组?我不知道确切的 URL,因为每个公司都有一个唯一的 ID 并且属于特定的市场 - 所以我不能将它们输入为 start_urls。
Scrapy 是正确的工具还是我应该使用 mechanize - 来完成这类任务?

这是我当前的代码。

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.http import FormRequest
from scrapy.http import Request
from tutorial.items import DmozItem
import json

class DmozSpider(BaseSpider):
    name = "dmoz"
    allowed_domains = ["proff.se"]
    start_urls = ["http://www.proff.se"]

# Search on the website, currently I have just put in a static search term here, but I would like to loop over a list of companies.

    def parse(self, response):
        return FormRequest.from_response(response, formdata={'q': rebtel},callback=self.search_result)

# I fetch the url from the search result and convert it to correct Financial-url where the information is located.

    def search_result(self,response):
        sel = HtmlXPathSelector(response)
        link = sel.xpath('//ul[@class="company-list two-columns"]/li/a/@href').extract()
        finance_url=str(link[0]).replace("/foretag","http://www.proff.se/nyckeltal")
        return Request(finance_url,callback=self.parse_finance)

# I Scraped the information of this particular company, this is hardcoded and will not 
# work for other responses. I had some issues with the encoding characters
# initially since they were Swedish. I also tried to target the Json element direct by
# revenue = sel.xpath('#//*[@id="accountTable1"]/tbody/tr[3]/@data-chart').extract()
# but was not able to parse it (error - expected string or buffer - tried to convert it
# to a string by str() with no luck, something off with the formatting, which is messing the the data types.    

    def parse_finance(self, response):
        sel = HtmlXPathSelector(response)
        datachart = sel.xpath('//tr/@data-chart').extract()
        employees=json.loads(datachart[36])
        revenue = json.loads(datachart[0])
        items = []
        item = DmozItem()
        item['company']=response.url.split("/")[-5]
        item['market']=response.url.split("/")[-3]
        item['employees']=employees
        item['revenue']=revenue
        items.append(item)
        return item

【问题讨论】:

    标签: python python-3.x web-scraping scrapy web-crawler


    【解决方案1】:

    常见的方法是使用命令行参数来执行此操作。给蜘蛛的__init__ 方法一个参数:

    class ProffSpider(BaseSpider):
        name = "proff"
        ...
    
        def __init__(self, query):
            self.query = query
    
        def parse(self, response):
            return FormRequest.from_response(response,
                formdata={'q': self.query},
                callback=self.search_result
            )
    
        ...
    

    然后启动你的蜘蛛(也许用 Scrapyd):

    $ scrapy crawl proff -a query="something"
    $ scrapy crawl proff -a query="something else"
    

    如果您想通过从文件中传递参数来一次运行一堆蜘蛛,您可以创建一个新命令来运行多个蜘蛛实例。这只是将内置的crawl 命令与example code for running multiple spiders 与单个爬虫混合:

    your_project/settings.py

    COMMANDS_MODULE = 'your_project_module.commands'
    

    your_project/commands/__init__.py

    # empty file
    

    your_project/commands/crawl_many.py

    import os
    import csv
    
    from scrapy.commands import ScrapyCommand
    from scrapy.utils.python import without_none_values
    from scrapy.exceptions import UsageError
    
    
    class Command(ScrapyCommand):
        requires_project = True
    
        def syntax(self):
            return '[options]'
    
        def short_desc(self):
            return 'Run many instances of a spider'
    
        def add_options(self, parser):
            ScrapyCommand.add_options(self, parser)
    
            parser.add_option('-f', '--input-file', metavar='FILE', help='CSV file to load arguments from')
            parser.add_option('-o', '--output', metavar='FILE', help='dump scraped items into FILE (use - for stdout)')
            parser.add_option('-t', '--output-format', metavar='FORMAT', help='format to use for dumping items with -o')
    
        def process_options(self, args, opts):
            ScrapyCommand.process_options(self, args, opts)
    
            if not opts.output:
                return
    
            if opts.output == '-':
                self.settings.set('FEED_URI', 'stdout:', priority='cmdline')
            else:
                self.settings.set('FEED_URI', opts.output, priority='cmdline')
    
            feed_exporters = without_none_values(self.settings.getwithbase('FEED_EXPORTERS'))
            valid_output_formats = feed_exporters.keys()
    
            if not opts.output_format:
                opts.output_format = os.path.splitext(opts.output)[1].replace('.', '')
    
            if opts.output_format not in valid_output_formats:
                raise UsageError('Unrecognized output format "%s". Valid formats are: %s' % (opts.output_format, tuple(valid_output_formats)))
    
            self.settings.set('FEED_FORMAT', opts.output_format, priority='cmdline')
    
        def run(self, args, opts):
            if args:
                raise UsageError()
    
            with open(opts.input_file, 'rb') as handle:
                for spider_options in csv.DictReader(handle):
                    spider = spider_options.pop('spider')
                    self.crawler_process.crawl(spider, **spider_options)
    
            self.crawler_process.start()
    

    你可以这样运行它:

    $ scrapy crawl_many -f crawl_options.csv -o output_file.jsonl
    

    抓取选项CSV的格式很简单:

    spider,query,arg2,arg3
    proff,query1,value2,value3
    proff,query2,foo,bar
    proff,query3,baz,asd
    

    【讨论】:

    • 但是这种方法可以采用元素列表吗?
    • 但是这种方法可以采用元素列表吗? scrapy crawl dmoz -a query="companies.txt" 将是公司列表。 def _init_(self,query): Companies = [line.strip() for line in open(query)] self.query = Companies,也许我并没有真正遵循你的建议。
    • 我想通了——实际上搜索查询的起始 URL 是 proff.se/bransch-s%C3%B6k?q="Company name"——因此我可以合并一个包含所有名称的文件,然后将其作为 start_urls 读入,甚至可以这样做Init 如果我想使用不同的文件集。谢谢你的回答。
    • @johndoe:具体做什么?
    • @johndoe: (b) 超出了这个问题的范围,但请参阅我对其他部分的编辑。
    【解决方案2】:

    我要做的第一件事是创建公司列表并找到获取每个公司网址的方法。在此之后爬行很容易。我写了一个爬虫来从疾病列表中提取维基百科的疾病信息。看看它如何适合您的用例。

    import requests
    from bs4 import BeautifulSoup
    import sys
    import re
    import nltk
    from nltk.corpus import stopwords
    import pandas as pd
    from subprocess import Popen, check_call
    from multiprocessing import Pool
    #nltk.download()
    
    def crawlwiki(keywords):
        print (keywords)
        columns = ['Category', 'Text']
        page=1
        print ('Fetching for {}....'.format(keywords))
        url = 'https://en.wikipedia.org/wiki/'
        for i in range(len(keywords)):
            url = url + keywords[i]
            url = url + '%20'
    
        url = url[0:(len(url)-3)]   
        output_obj = {}
        #curr_page = url+str(page)
        while True:
            try:
                page_source = requests.get(url)
            except:
    
    #What you should do if internet connection fails
            break
    
        plain_text = page_source.text
        bs_obj = BeautifulSoup(plain_text, "lxml")
        '''toc_links = bs_obj.findAll('div', {'class': 'toc-links'})
        base_url = 'http://www.webmd.com'
        for div in toc_links:
            links = div.findAll('a')
            for a in links:
                output_obj[a.text] = base_url + a.get('href')
                print (base_url + a.get('href'))
        data = bs_obj.findAll('div', {'class':'search-text-container'})
        for div in data:
            links = div.findAll('a')
            for a in links:
                output_obj[a.text] = a.get('href')
                print (a.get('href'))'''
    
    
        """
            Mapping:
            1 : Signs and symptoms
            2 : Diagnosis
            3 : Prognosis
            4 : Treatment
    
        """
    
        symptom_text = re.findall ( '<h2><span class="mw-headline" id="Signs_and_symptoms">Signs and symptoms</span>(.*?)<h2>', plain_text, re.DOTALL)
        str1 = ''.join(symptom_text)
        symptoms_object = BeautifulSoup(str1, "lxml")
        #paragraphs = re.findall('<p>(.*?)<p>', str1, re.DOTALL)
        symptom_data = symptoms_object.findAll('p')
        symptom_paragraphs = ""
        for p in symptom_data:
            symptom_paragraphs += p.text
    
        symptom_paragraphs = re.sub(r"/?\[\d+]" , '', symptom_paragraphs, re.DOTALL)
        df_1 = pd.DataFrame(data=[['1', symptom_paragraphs]], columns=columns)
    
        diagnosis_text = re.findall ( '<h2><span class="mw-headline" id="Diagnosis">Diagnosis</span>(.*?)<h2>', plain_text, re.DOTALL)
        str1 = ''.join(diagnosis_text)
        diagnosis_object = BeautifulSoup(str1, "lxml")
        #paragraphs = re.findall('<p>(.*?)<p>', str1, re.DOTALL)
        diagnosis_data = diagnosis_object.findAll('p')
        diagnosis_paragraphs = ""
        for p in diagnosis_data:
            diagnosis_paragraphs += p.text
    
        diagnosis_paragraphs = re.sub(r"/?\[\d+]"   , '', diagnosis_paragraphs, re.DOTALL)
        df_2 = pd.DataFrame(data=[['2', diagnosis_paragraphs]], columns=columns)
    
        prognosis_text = re.findall ( '<h2><span class="mw-headline" id="Prognosis">Prognosis</span>(.*?)<h2>', plain_text, re.DOTALL)
        str1 = ''.join(prognosis_text)
        prognosis_object = BeautifulSoup(str1, "lxml")
        #paragraphs = re.findall('<p>(.*?)<p>', str1, re.DOTALL)
        prognosis_data = prognosis_object.findAll('p')
        prognosis_paragraphs = ""
        for p in prognosis_data:
            prognosis_paragraphs += p.text
    
        prognosis_paragraphs = re.sub(r"/?\[\d+]"   , '', prognosis_paragraphs, re.DOTALL)
        df_3 = pd.DataFrame(data=[['3', prognosis_paragraphs]], columns=columns)
    
        treatment_text = re.findall ( '<h2><span class="mw-headline" id="Treatment">Treatment</span>(.*?)<h2>', plain_text, re.DOTALL)
        str1 = ''.join(treatment_text)
        treatment_object = BeautifulSoup(str1, "lxml")
        #paragraphs = re.findall('<p>(.*?)<p>', str1, re.DOTALL)
        treatment_data = treatment_object.findAll('p')
        treatment_paragraphs = ""
        for p in treatment_data:
            treatment_paragraphs += p.text
    
        treatment_paragraphs = re.sub(r"/?\[\d+]"   , '', treatment_paragraphs, re.DOTALL)
        df_4 = pd.DataFrame(data=[['4', treatment_paragraphs]], columns=columns)
    
        df = pd.DataFrame(columns = columns)
    
        df = df.append(df_1.append(df_2.append(df_3.append(df_4))))
    
        return df
        print('Fetch completed....')
    
    
    
    def main():
    
        disease_df = pd.read_csv("disease.txt", sep="\n", header=None)
    
        columns = ['Category', 'Text']
        df_data = pd.DataFrame(columns=columns)
        size = disease_df.size
        print("Initializing....")
        p = Pool(5)
        df_data = p.map(crawlwiki, disease_df.values.tolist())
        """for index, row in disease_df.iterrows():
            print('Iteration {0} out of {1}.....'.format(index+1, size))
            df = crawlwiki(row, columns)
            df_data = df_data.append(df)"""
    
        df_data.to_csv("TagDataset.csv", index=False)
    
    
    
    
    if __name__ == '__main__':
        main()
    

    【讨论】:

    • 爬得不错..让我检查一下
    • 这个问题被专门标记为scrapy,这个答案对于使用scrapy的人来说并不是真的有用。
    • @Blender 这个问题并没有关注你使用什么包进行爬取,它想要一种方法来爬取搜索词列表,这就是这里突出显示的内容。不是爬行本身。
    猜你喜欢
    • 2021-05-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2018-01-08
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多