xpath 没有被选中答案

【问题标题】：xpath not getting selectedxpath 没有被选中
【发布时间】：2013-11-07 19:06:16
【问题描述】：

我刚刚开始使用 Scrapy：这是我要抓取的网站示例：

http://www.thefreedictionary.com/shame

我的蜘蛛的代码：

from scrapy.spider import BaseSpider
from scrapy.http import Request
from scrapy.selector import HtmlXPathSelector
from dic_crawler.items import DicCrawlerItem

from urlBuilder import *   

class Dic_crawler(BaseSpider):
    name = "dic"
    allowed_domains = ["www.thefreedictionary.com"]
    start_urls = listmaker()[:]
    print start_urls

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        sites = hxs.select('//*[@id="MainTxt"]/table/tbody')
        print 'SITES:\n',sites


        item = DicCrawlerItem()

        item["meanings"] = sites.select('//*[@id="MainTxt"]/table/tbody/tr/td/div[1]/div[1]/div[1]/text()').extract()

        print item

        return item

listmaker() 返回要废弃的 url 列表。

我的问题是，如果我在 xpath 中选择 until 'tbody' 并返回一个空的 sites 变量，那么 sites 变量就会为空, 而如果我只选择表格，我会得到我想要的网站部分。

由于 tbody 之后的部分没有选择 tbody.

同时，该网站给出了我想提取的多种含义，但我只知道如何提取一种方法。

谢谢

【问题讨论】：

您的示例页面的 HTML 源代码不包含 tbody 元素，这些很可能是您的浏览器“检查”工具添加的。此外，您可能想要循环 <div class="pseg">，然后为每个 -- "n."、"tr.v."等等——你可以循环<div class="ds-list">来提取每个类别的不同定义
我添加了一个示例蜘蛛作为答案

标签： python-2.7 xpath scrapy

【解决方案1】：

这是一个蜘蛛骨架，可以帮助您入门：

from scrapy.spider import BaseSpider
from scrapy.http import Request
from scrapy.selector import HtmlXPathSelector

class Dic_crawler(BaseSpider):
    name = "thefreedictionary"
    allowed_domains = ["www.thefreedictionary.com"]
    start_urls = ['http://www.thefreedictionary.com/shame']

    def parse(self, response):
        hxs = HtmlXPathSelector(response)

        # loop on each "noun" or "verb" or something... section
        for category in hxs.select('id("MainTxt")//div[@class="pseg"]'):

            # this is simply to get what's in the <i> tag
            category_name = u''.join(category.select('./i/text()').extract())
            self.log("category: %s" % category_name)

            # for each category, a term can have multiple definition
            # category from .select() is a selector
            # so you can call .select() on it also,
            # here with a relative XPath expression selecting all definitions
            for definition in category.select('div[@class="ds-list"]'):
                definition_text = u'\n'.join(
                    definition.select('.//text()').extract())
                self.log(" - definition: %s" % definition_text)

【讨论】：

谢谢，效果很好。不过需要我一点时间来理解。如果不是太麻烦，你能告诉我 self.log 是如何工作的吗？？？？
引用doc.scrapy.org/en/0.18/topics/logging.html#logging-from-spiders，“推荐的从spider记录的方法是使用Spider log()方法，它已经填充了scrapy.log.msg()的spider参数”。另见doc.scrapy.org/en/0.18/topics/debug.html#logging