baohanblog

爬取数据并解析

爬虫文件中parse方法中写:

def parse(self, response):  # 解析,请求回来,自动执行parser,在这个方法中做解析
        # 解析 方式一:使用bs4解析
        # from bs4 import BeautifulSoup
        # soup=BeautifulSoup(response.text,\'lxml\')
        # soup.find_all()

        # 方式二:使用内置的css解析器
        # css与xpath解析后的数据都放在列表中
        # 取第一个:extract_first()
        # 取出所有的extract()
        # css选择器取文本和属性:
            # .link-title::text
            # .link-title::attr(href)
        div_list = response.css(\'div.link-item\')
        for div in div_list:
            title = div.css(\'.link-title::text\').extract_first()
            url = div.css(\'.link-title::attr(href)\').extract_first()
            if \'http\' not in url:
                url = \'https://dig.chouti.com/\'+url
            img_url = div.css(\'.image-scale::attr(src)\').extract_first()
            if not img_url:
                img_url = div.css(\'.image-item::attr(src)\').extract_first()
            print(\'\'\'
            新闻标题:%s
            新闻连接:%s
            新闻图片:%s
            \'\'\' % (title, url, img_url))

        # 方式三:使用内置的xpath解析
        # css与xpath解析后的数据都放在列表中
        # 取第一个:extract_first()
        # 取出所有的extract()
        # xpath选择器取文本和属性:
            # /text()
            # /@属性名
        div_list = response.xpath(\'//div[contains(@class,"link-item")]\')
        for div in div_list:
            title = div.xpath(\'.//a[contains(@class,"link-title")]/text()\').extract_first()
            url = div.xpath(\'.//a[contains(@class,"link-title")]/@href\').extract_first()
            # if \'http\' not in url:
            #     url = \'https://dig.chouti.com/\' + url
            img_url = div.xpath(\'.//*[contains(@class,"image-scale")]/@src\').extract_first()
            # if not img_url:
            #     img_url = div.xpath(\'.//*[contains(@class,"image-item")]/@src\').extract_first()
            print(\'\'\'
            新闻标题:%s
            新闻连接:%s
            新闻图片:%s
            \'\'\' % (title, url, img_url))

 

分类:

技术点:

相关文章: