【问题标题】:Handling redirecting <301> from Indeed with Scrapy使用 Scrapy 从 Indeed 处理重定向 <301>
【发布时间】:2022-01-25 00:04:08
【问题描述】:

我正在为 Indeed 构建一个人员抓取工具,主要用于练习 - 我已经对其进行了设置,以便在每个页面中提取每 100 个结果的详细信息。通过使用搜索查询,我在确实 url 的 f 字符串中循环了一个城市和工作类型的种子列表。我将这些结果存储为字典,以便在将这些结果读入 pandas 时将学位类型作为列获取。

我的问题是我不断收到Redirecting (301),我想这是因为并非所有链接都满足薪水要求。或者,我已经包含了meta={'handle_httpstatus_list': [301]},但无论如何我都没有得到任何结果。

这是我的刮刀:

class IndeedItem(scrapy.Item):
    job_title = Field(output_processor = TakeFirst())
    salary = Field(output_processor = TakeFirst())
    category = Field(output_processor = TakeFirst())
    company = Field(output_processor = TakeFirst())

class IndeedSpider(scrapy.Spider):
    name = 'indeed'
    max_results_per_city = 1000
    #names = pd.read_csv("indeed_names.csv")
    #degree = pd.read_csv("degree_names2.csv",encoding='unicode_escape')
    names = pd.DataFrame({'names':['London', 'Manchester']})
    degree = pd.DataFrame({'degrees':['degree+Finance+£25','degree+Engineering+£25'], 'degree_type':['Finance', 'Engineering']})
    start_urls = defaultdict(list)
    for city in names.names:
        for qualification,name in zip(degree.degrees, degree.degree_type):
            start_urls[name].append(f'https://uk.indeed.com/jobs?q={qualification}%2C000&l={city}&fromage=7&filter=0&limit=100')

    custom_settings = {
        'USER_AGENT': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36',
        'DOWNLOAD_DELAY':2
    }

    def start_requests(self):
        for category, url in self.start_urls.items():
            for link in url:
                yield scrapy.Request(
                    link, 
                    callback = self.parse,
                    #meta={'handle_httpstatus_list': [301]},
                    cb_kwargs = {
                        'page_count':0,
                        'category':category 
                }
            )

    def parse(self, response, page_count, category):
        if page_count > 30:
            return
        indeed = response.xpath('//div[@id="mosaic-zone-jobcards"]//div')
        for jobs in indeed:
            loader = ItemLoader(IndeedItem(), selector = jobs)
            loader.add_value('category', category)
            loader.add_xpath('job_title', './/h2[@class="jobTitle jobTitle-color-purple jobTitle-newJob"]/span//text()')
            loader.add_xpath('salary', './/div[@class="salary-snippet"]/span//text()')
            loader.add_xpath('company', './/a/div[@class="slider_container"]/div[@class="slider_list"]/div[@class="slider_item"]/div[@class="job_seen_beacon"]/table[@class="jobCard_mainContent"]/tbody/tr/td[@class="resultContent"]/div[@class="heading6 company_location tapItem-gutter"]/pre/span[@class="companyName"]//text()')
            yield loader.load_item
        
        next_page = response.xpath('//ul[@class="pagination-list"]/li[5]/a//@href').get()
        page_count += 1
        if next_page is not None:
            yield response.follow(
                next_page, 
                callback = self.parse,
                cb_kwargs = {
                    'page_count': page_count,
                    'category': category
                }
            )

【问题讨论】:

    标签: python web-scraping scrapy


    【解决方案1】:

    我没有任何 301 状态,但是 start_urls 给了我问题,并且您的 xpath 已关闭

    这修复了 xpath:

    import scrapy
    from pandas._libs.internals import defaultdict
    from scrapy import Field
    from scrapy.loader import ItemLoader
    from scrapy.loader.processors import TakeFirst
    import pandas as pd
    
    
    class IndeedItem(scrapy.Item):
        job_title = Field(output_processor=TakeFirst())
        salary = Field(output_processor=TakeFirst())
        category = Field(output_processor=TakeFirst())
        company = Field(output_processor=TakeFirst())
    
    
    class IndeedSpider(scrapy.Spider):
        name = 'indeed'
    
        custom_settings = {
            'USER_AGENT': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36',
            'DOWNLOAD_DELAY': 2
        }
    
        max_results_per_city = 1000
        # names = pd.read_csv("indeed_names.csv")
        # degree = pd.read_csv("degree_names2.csv",encoding='unicode_escape')
        names = pd.DataFrame({'names': ['London', 'Manchester']})
        degree = pd.DataFrame({'degrees': ['degree+Finance+£25,000', 'degree+Engineering+£25,000'], 'degree_type': ['Finance', 'Engineering']})
    
        start_urls = defaultdict(list)
    
        def start_requests(self):
            for city in self.names.names:
                for qualification, name in zip(self.degree.degrees, self.degree.degree_type):
                    self.start_urls[name].append(f'https://uk.indeed.com/jobs?q={qualification}&l={city}&fromage=7&filter=0&limit=100')
    
            for category, url in self.start_urls.items():
                for link in url:
                    yield scrapy.Request(
                        link,
                        callback=self.parse,
                        #meta={'handle_httpstatus_list': [301]},
                        cb_kwargs={
                            'page_count': 0,
                            'category': category
                        }
                    )
    
        def parse(self, response, page_count, category):
            if page_count > 30:
                return
            indeed = response.xpath('//div[@class="slider_container"]')
            for jobs in indeed:
                loader = ItemLoader(IndeedItem(), selector=jobs)
                loader.add_value('category', category)
                loader.add_xpath('job_title', './/span[@title]//text()')
                loader.add_xpath('salary', './/div[@class="salary-snippet"]/span//text()')
                loader.add_xpath('company', './/span[@class="companyName"]//text()')
                yield loader.load_item()
    
            next_page = response.xpath('//ul[@class="pagination-list"]//li[last()]/a/@href').get()
            page_count += 1
            if next_page:
                yield response.follow(
                    next_page,
                    callback=self.parse,
                    cb_kwargs={
                        'page_count': page_count,
                        'category': category
                    }
                )
    

    如果你能给出一个重定向的例子,我可以帮助你。

    【讨论】:

    • 我不知道为什么,但是当我运行我的脚本时,我得到了Redirecting (301),但是有了你的更新,它工作正常。我没有想过在 start_requests 中为 url 运行循环,然后在它之外获取 url。很简约!我还需要一些 xpath 的练习,你知道有什么好的文档来提高我的技能吗?
    • 我没有使用任何 xpath 教程,但我确信有很多。
    • 一段时间后,网站将我返回到验证码页面;您可能知道减少这种情况发生的任何技巧吗?我使用了DOWNLOAD_DELAY,延迟时间更长可能会有所帮助,但刮擦时间会更长。
    • @joe_bill.dollar 查看scrapy-rotating-proxies
    猜你喜欢
    • 1970-01-01
    • 2018-04-18
    • 1970-01-01
    • 1970-01-01
    • 2011-07-25
    • 2011-02-26
    • 2020-02-08
    • 2020-03-16
    相关资源
    最近更新 更多