Scrapy进阶 - 爱码网

当我们使用scrapy框架爬取网站的时候，我们会有一个入口的url，一个名为start_urls,我们爬取的第一个网页是从这一开始的。

需求：

　　现在我们有一个这样的需求，比如说我们对起始的URL有一个相对应的处理函数，对后面的爬取的url又要进行一个新的处理那么我们需要两个处理函数。

　　如果我们想对起始的url的处理函数不为默认的parser那我们应该怎么修改呢

在父类的中我们可以看到一个名为start_requests的函数他就控制了起始url使用什么调用什么回调函数所以我们只要重写他就可以了，在scrapy中yield一个Request对象（已经可以设置callback了）

那么scrapy框架会自动将其放入调度器，然后爬取

class ZhipinSpider(scrapy.Spider):
    name = 'zhipin'
    allowed_domains = ['zhipin.com']
    start_urls = ['xxx.com']

    # 方案一
    def start_requests(self):
            for url in self.start_urls:
                yield Request(url=url,callback=self.parse2)
    
    # 方案二
    def start_requests(self):
    req_list = []
    for url in self.start_urls:
        req_list.append(Request(url=url,callback=self.parse2))
    return req_list

    def parser(self, response):
        pass

    def parser2(self, response):
        pass

这里使用yield和返回一个列表的效果是一样的，因为在scrapy内部会使用iter()方法最后返回的都是一个可迭代对象。

解析器

parser中的参数是一个response对象我们需要用解析器来对其进行解析

有两种方式一种是他内部实现了我们可以直接对其进行解析

response.xpath('//div[@id='content-list']/div[@class='item']')

还有一种是导入模块的方式进行解析：

from scrapy.selector import HtmlXPathSelector

。。。
def parser(self, response):
    hxs = HtmlXPathSelector(respone=response)
    items = hxs.xpath("//div[@id='content-list']/div[@class='item']")

查找规则：

hxs = Selector(response=response).xpath('//div') # 去子子孙孙下找div标签

hxs = Selector(response=response).xpath('/div')  # 去儿子下找div 标签

hxs = Selector(response=response).xpath('//div[2]') # 去子子孙孙下找第二个div标签

hxs = Selector(response=response).xpath('//a[@id]') # 找有id属性的a标签
 
hxs = Selector(response=response).xpath('//a[@的所有a标签（id不重复，但是可能不是id的情况下）

hxs = Selector(response=response).xpath('//a[@href="link.html"][@]') # 且的关系

hxs = Selector(response=response).xpath('//a[contains(@href, "link")]') # 有这两个属性的a标签

hxs = Selector(response=response).xpath('//a[re:test(@id, "i\d+")]') # 使用正则来匹配

解析得到的类型：

标签对象：  xpath('/html/body/ul/li/a/@href')
列表：     xpath('/html/body/ul/li/a/@href').extract()
值：       xpath('//body/ul/li/a/@href').extract_first()

如何独立使用scrapy的解析器：

from scrapy.selector import Selector, HtmlXPathSelector
from scrapy.http import HtmlResponse
html = """<!DOCTYPE html>
<html>
    <head lang="en">
        <meta charset="UTF-8">
        <title></title>
    </head>
    <body>
        <ul>
            <li class="item-"><a id='i1' href="link.html">first item</a></li>
            <li class="item-0"><a id='i2' href="llink.html">first item</a></li>
            <li class="item-1"><a href="llink2.html">second item<span>vv</span></a></li>
        </ul>
        <div><a href="llink2.html">second item</a></div>
    </body>
</html>
"""
response = HtmlResponse(url='http://example.com', body=html,encoding='utf-8')
obj = response.xpath('//a[@>).extract_first()
print(obj)

单独应用