Scrapy 使用 lxml 显示 xpath 文本答案

【问题标题】：Scrapy displaying xpath text using lxmlScrapy 使用 lxml 显示 xpath 文本
【发布时间】：2023-03-10 11:54:01
【问题描述】：

如何让我的 parse_page 显示项目标题的文本和数值？我只能显示href。

    def parse_page(self, response):
    self.log("\n\n\n Page for one device \n\n\n")
    self.log('Hi, this is the parse_page page! %s' % response.url)
    root = lxml.etree.fromstring(response.body)
    for row in root.xpath('//row'):
        allcells = row.xpath('./cell')
        #... populate Items
    for cells in allcells:
        item = CiqdisItem()
        item['title'] = cells.get(".//text()")
        item['link'] = cells.get("href")
        yield item

我的 xml 文件

<row>
<cell type="html">
<input type="checkbox" name="AF2C4452827CF0935B71FAD58652112D" value="AF2C4452827CF0935B71FAD58652112D" onclick="if(typeof(selectPkg)=='function')selectPkg(this);">
</cell>
<cell type="plain" style="width: 50px; white-space: nowrap;" visible="false">http://qvpweb01.ciq.labs.att.com:8080/dis/metriclog.jsp?PKG_GID=AF2C4452827CF0935B71FAD58652112D&amp;view=list</cell>
<cell type="plain">6505550000</cell>
<cell type="plain">probe0</cell>
<cell type="href" style="width: 50px; white-space: nowrap;" href="metriclog.jsp?PKG_GID=AF2C4452827CF0935B71FAD58652112D&view=list">
UPTR
<input id="savePage_AF2C4452827CF0935B71FAD58652112D" type="hidden" value="AF2C4452827CF0935B71FAD58652112D">
</cell>
<cell type="href" href="/dis/packages.jsp?show=perdevice&device_gid=3651746C4173775343535452414567746D75643855673D3D53564A6151624D41716D534C68395A6337634E2F62413D3D&hwdid=probe0&mdn=6505550000&subscrbid=6505550000&triggerfilter=&maxlength=100&view=timeline&date=20100716T050314876" style="white-space: nowrap;">2010-07-16 05:03:14.876</cell>
<cell type="plain" style="width: 50px; white-space: nowrap;"></cell>
<cell type="plain" style="white-space: nowrap;"></cell>
<cell type="plain" style="white-space: nowrap;">2012-10-22 22:40:15.504</cell>
<cell type="plain" style="width: 70px; white-space: nowrap;">1 - SMS_PullRequest_CS</cell>
<cell type="href" style="width: 50px; white-space: nowrap;" href="/dis/profile_download?profileId=4294967295">4294967295</cell>
<cell type="plain" style="width: 50px; white-space: nowrap;">250</cell>
</row>

下面是我的最新编辑，我同时展示了这两种方法。问题是第一种方法没有按顺序解析 A 列中的所有链接，它是无序的，如果 A 列为空，它会从 B 列抓取下一个链接。我怎样才能只显示 A 列，如果列A 为 null 跳过它并沿 A 列向下走？

方法2 parse_page。不迭代所有行。这是不完整的解析。如何获取所有行？

    def parse_device_list(self, response):
    self.log("\n\n\n List of devices \n\n\n")
    self.log('Hi, this is the parse_device_list page! %s' % response.url)
    root = lxml.etree.fromstring(response.body)
    for row in root.xpath('//row'):
        allcells = row.xpath('.//cell')
        # first cell contain the link to follow
        detail_page_link = allcells[0].get("href")
        yield Request(urlparse.urljoin(response.url, detail_page_link ), callback=self.parse_page)

    def parse_page(self, response):
    self.log("\n\n\n Page for one device \n\n\n")
    self.log('Hi, this is the parse_page page! %s' % response.url)
    xxs = XmlXPathSelector(response)
    for row in xxs.select('//row'):
       for cell in row.select('.//cell'):
           item = CiqdisItem()
           item['title'] = cell.select("text()").extract()
           item['link'] = cell.select("@href").extract()
           yield item

【问题讨论】：

标签： xpath xml-parsing screen-scraping scrapy lxml

【解决方案1】：

只需将.//text() 替换为text() 并将href 替换为@href。

另外，为什么要使用 lxml？ Scrapy 内置了 xpath 选择器，试一试：

def parse_page(self, response):
    hxs = HtmlXPathSelector(response)
    for row in hxs.select('//row'):
        for cell in row.select('.//cell'):
           item = CiqdisItem()
           item['title'] = cell.get("text()")
           item['link'] = cell.get("@href")
           yield item

【讨论】：

谢谢@alecxe - 我只需要从 hxs 更改为 xxs = XmlXPathSelector(response)。我还发布了另一个问题link to my second question 将 lxml 转换为 scrapy 构建的 xxs。对于这个，我首先尝试在 xxs 中执行此操作，但失败了，直到有人告诉我可能尝试使用 lxml 让它工作并且它确实在 lxml 中工作。
嘿@alecxe - 在分析了我的网络爬虫后，我注意到这并没有解析每个表的所有行，它只分成几行而不是全部。所有行都在同一页中。（它不限制每页的行数。）如果我编辑我的问题并粘贴我的两种方法，也许会有所帮助。