使用 XMLFeedSpider 解析 html 和 xml答案

【问题标题】：Use XMLFeedSpider to parse html and xml使用 XMLFeedSpider 解析 html 和 xml
【发布时间】：2016-11-03 14:11:53
【问题描述】：

我有一个网页，我从中获取 RSS 链接。链接是 XML，我想使用 XMLFeedSpider 功能来简化解析。

这可能吗？

这将是流程：

GET example.com/rss（返回 HTML）
解析 html 并获取 RSS 链接
foreach 链接解析 XML

【问题讨论】：

标签： html xml scrapy web-crawler

【解决方案1】：

我找到了一种基于现有example in the documentation 并查看源代码的简单方法。这是我的解决方案：

from scrapy.spiders import XMLFeedSpider
from myproject.items import TestItem

class MySpider(XMLFeedSpider):
    name = 'example.com'
    allowed_domains = ['example.com']
    start_urls = ['http://www.example.com/feed.xml']
    iterator = 'iternodes'  # This is actually unnecessary, since it's the default value
    itertag = 'item'

    def start_request(self):
        urls = ['http://www.example.com/get-feed-links']
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse_main)

    def parse_main(self, response):
        for el in response.css("li.feed-links"):
            yield scrapy.Request(el.css("a::attr(href)").extract_first(),
                                 callback=self.parse)

    def parse_node(self, response, node):
        self.logger.info('Hi, this is a <%s> node!: %s', self.itertag,     ''.join(node.extract()))

        item = TestItem()
        item['id'] = node.xpath('@id').extract()
        item['name'] = node.xpath('name').extract()
        item['description'] = node.xpath('description').extract()
        return item

【讨论】：