【问题标题】:Python Scrapy: Convert relative paths to absolute pathsPython Scrapy:将相对路径转换为绝对路径
【发布时间】:2011-06-27 22:19:35
【问题描述】:

我已经根据这里的伟人提供的解决方案修改了代码;我在这里得到代码下方显示的错误。

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.utils.response import get_base_url
from scrapy.utils.url import urljoin_rfc
from dmoz2.items import DmozItem

class DmozSpider(BaseSpider):
   name = "namastecopy2"
   allowed_domains = ["namastefoods.com"]
   start_urls = [
    "http://www.namastefoods.com/products/cgi-bin/products.cgi?Function=show&Category_Id=4&Id=1",
    "http://www.namastefoods.com/products/cgi-bin/products.cgi?Function=show&Category_Id=4&Id=12",    

]

def parse(self, response):
    hxs = HtmlXPathSelector(response)
    sites = hxs.select('/html/body/div/div[2]/table/tr/td[2]/table/tr')
    items = []
    for site in sites:
        item = DmozItem()
        item['manufacturer'] = 'Namaste Foods'
        item['productname'] = site.select('td/h1/text()').extract()
        item['description'] = site.select('//*[@id="info-col"]/p[7]/strong/text()').extract()
        item['ingredients'] = site.select('td[1]/table/tr/td[2]/text()').extract()
        item['ninfo'] = site.select('td[2]/ul/li[3]/img/@src').extract()
        #insert code that will save the above image path for ninfo as an absolute path
        base_url = get_base_url(response)
        relative_url = site.select('//*[@id="showImage"]/@src').extract()
        item['image_urls'] = urljoin_rfc(base_url, relative_url)
        items.append(item)
    return items

我的 items.py 看起来像这样:

from scrapy.item import Item, Field

class DmozItem(Item):
    # define the fields for your item here like:
    productid = Field()
    manufacturer = Field()
    productname = Field()
    description = Field()
    ingredients = Field()
    ninfo = Field()
    imagename = Field()
    image_paths = Field()
    relative_images = Field()
    image_urls = Field()
    pass

我需要蜘蛛为 items['relative_images'] 获取的相对路径转换为绝对路径并保存在 items['image_urls'] 中,以便我可以从这个蜘蛛本身下载图像。例如,蜘蛛获取的 relative_images 路径是 '../../files/images/small/8270-BrowniesHiResClip.jpg',这应该转换为 'http://namastefoods.com/files/images/small /8270-BrowniesHiResClip.jpg', & 存储在 items['image_urls']

我还需要将 items['ninfo'] 路径存储为绝对路径。

运行上述代码时出错:

2011-06-28 17:18:11-0400 [scrapy] INFO: Scrapy 0.12.0.2541 started (bot: dmoz2)
2011-06-28 17:18:11-0400 [scrapy] DEBUG: Enabled extensions: TelnetConsole, SpiderContext, WebService, CoreStats, CloseSpider
2011-06-28 17:18:11-0400 [scrapy] DEBUG: Enabled scheduler middlewares: DuplicatesFilterMiddleware
2011-06-28 17:18:11-0400 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMiddleware, DownloaderStats
2011-06-28 17:18:11-0400 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2011-06-28 17:18:11-0400 [scrapy] DEBUG: Enabled item pipelines: MyImagesPipeline
2011-06-28 17:18:11-0400 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2011-06-28 17:18:11-0400 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2011-06-28 17:18:11-0400 [namastecopy2] INFO: Spider opened
2011-06-28 17:18:12-0400 [namastecopy2] DEBUG: Crawled (200) <GET http://www.namastefoods.com/products/cgi-bin/products.cgi?Function=show&Category_Id=4&Id=12> (referer: None)
2011-06-28 17:18:12-0400 [namastecopy2] ERROR: Spider error processing <http://www.namastefoods.com/products/cgi-bin/products.cgi?Function=show&Category_Id=4&Id=12> (referer: <None>)
    Traceback (most recent call last):
      File "/System/Library/Frameworks/Python.framework/Versions/2.6/Extras/lib/python/twisted/internet/base.py", line 1137, in mainLoop
        self.runUntilCurrent()
      File "/System/Library/Frameworks/Python.framework/Versions/2.6/Extras/lib/python/twisted/internet/base.py", line 757, in runUntilCurrent
        call.func(*call.args, **call.kw)
      File "/System/Library/Frameworks/Python.framework/Versions/2.6/Extras/lib/python/twisted/internet/defer.py", line 243, in callback
        self._startRunCallbacks(result)
      File "/System/Library/Frameworks/Python.framework/Versions/2.6/Extras/lib/python/twisted/internet/defer.py", line 312, in _startRunCallbacks
        self._runCallbacks()
    --- <exception caught here> ---
      File "/System/Library/Frameworks/Python.framework/Versions/2.6/Extras/lib/python/twisted/internet/defer.py", line 328, in _runCallbacks
        self.result = callback(self.result, *args, **kw)
      File "/***/***/***/***/***/***/spiders/namaste_copy2.py", line 30, in parse
        item['image_urls'] = urljoin_rfc(base_url, relative_url)
      File "/Library/Python/2.6/site-packages/Scrapy-0.12.0.2541-py2.6.egg/scrapy/utils/url.py", line 37, in urljoin_rfc
        unicode_to_str(ref, encoding))
      File "/Library/Python/2.6/site-packages/Scrapy-0.12.0.2541-py2.6.egg/scrapy/utils/python.py", line 96, in unicode_to_str
        raise TypeError('unicode_to_str must receive a unicode or str object, got %s' % type(text).__name__)
    exceptions.TypeError: unicode_to_str must receive a unicode or str object, got list

2011-06-28 17:18:15-0400 [namastecopy2] DEBUG: Crawled (200) <GET http://www.namastefoods.com/products/cgi-bin/products.cgi?Function=show&Category_Id=4&Id=1> (referer: None)
2011-06-28 17:18:15-0400 [namastecopy2] ERROR: Spider error processing <http://www.namastefoods.com/products/cgi-bin/products.cgi?Function=show&Category_Id=4&Id=1> (referer: <None>)
    Traceback (most recent call last):
      File "/System/Library/Frameworks/Python.framework/Versions/2.6/Extras/lib/python/twisted/internet/base.py", line 1137, in mainLoop
        self.runUntilCurrent()
      File "/System/Library/Frameworks/Python.framework/Versions/2.6/Extras/lib/python/twisted/internet/base.py", line 757, in runUntilCurrent
        call.func(*call.args, **call.kw)
      File "/System/Library/Frameworks/Python.framework/Versions/2.6/Extras/lib/python/twisted/internet/defer.py", line 243, in callback
        self._startRunCallbacks(result)
      File "/System/Library/Frameworks/Python.framework/Versions/2.6/Extras/lib/python/twisted/internet/defer.py", line 312, in _startRunCallbacks
        self._runCallbacks()
    --- <exception caught here> ---
      File "/System/Library/Frameworks/Python.framework/Versions/2.6/Extras/lib/python/twisted/internet/defer.py", line 328, in _runCallbacks
        self.result = callback(self.result, *args, **kw)
      File "/***/***/***/***/***/***/spiders/namaste_copy2.py", line 30, in parse
        item['image_urls'] = urljoin_rfc(base_url, relative_url)
      File "/Library/Python/2.6/site-packages/Scrapy-0.12.0.2541-py2.6.egg/scrapy/utils/url.py", line 37, in urljoin_rfc
        unicode_to_str(ref, encoding))
      File "/Library/Python/2.6/site-packages/Scrapy-0.12.0.2541-py2.6.egg/scrapy/utils/python.py", line 96, in unicode_to_str
        raise TypeError('unicode_to_str must receive a unicode or str object, got %s' % type(text).__name__)
    exceptions.TypeError: unicode_to_str must receive a unicode or str object, got list

2    011-06-28 17:18:15-0400 [namastecopy2] INFO: Closing spider (finished)
2011-06-28 17:18:15-0400 [namastecopy2] INFO: Spider closed (finished)

谢谢。-TM

【问题讨论】:

  • 不要在没有创建 cmets 的情况下更新您的问题 - 否则我们不会收到通知,也不知道您需要更多信息。如果发现任何有用的回复 - 也可以投票。
  • 也把你的日志/回溯到代码块中
  • 别忘了给你觉得有用的回复点赞

标签: python scrapy imagesource


【解决方案1】:

来自Scrapy docs

def parse(self, response):
    # ... code ommited
    next_page = response.urljoin(next_page)
    yield scrapy.Request(next_page, self.parse)

也就是说,response 对象有一个方法可以做到这一点。

【讨论】:

    【解决方案2】:

    我做的是:

    import urlparse
    ...
    
    def parse(self, response):
        ...
        urlparse.urljoin(response.url, extractedLink.strip())
        ...
    

    通知strip(),因为我有时会遇到奇怪的链接,例如:

    <a href="
                  /MID_BRAND_NEW!%c2%a0MID_70006_Google_Android_2.2_7%22%c2%a0Tablet_PC_Silver/a904326516.html
                ">MID BRAND NEW!&nbsp;MID 70006 Google Android 2.2 7"&nbsp;Tablet PC Silver</a>
    

    【讨论】:

    • 值得补充的是,url 不是由 urljoin() 连接的,而不是像 netloc 或 path 这样的 url 部分被覆盖。因此urljoin('http://www.myeshop.com/category/subcategory', '/category/subcategory/item001.php') 不会返回http://www.myeshop.com/category/subcategory/category/subcategory/item001.php,而是更明智的http://www.myeshop.com/category/subcategory/item001.php
    • 警告:对于 python 3 根据doc:“urlparse 模块在 Python 3 中被重命名为 urllib.parse。2to3 工具将在将源转换为 Python 3 时自动调整导入。”
    • 在 Python 3 中它变成:import urllib.parse 并使用它urllib.parse.urljoin(response.url, extractedLink.strip())
    【解决方案3】:
    from scrapy.utils.response import get_base_url
    
    base_url           = get_base_url(response)
    relative_url       = site.select('//*[@id="showImage"]/@src').extract()
    item['image_urls'] = [urljoin_rfc(base_url,ru) for ru in relative_url]
    

    或者你可以只提取一项

    base_url           = get_base_url(response)
    relative_url       = site.select('//*[@id="showImage"]/@src').extract()[0]
    item['image_urls'] = urljoin_rfc(base_url,relative_url)
    

    错误是因为您将列表而不是 str 传递给 urljoin 函数。

    【讨论】:

    • 感谢@buffer。我在上面尝试了您的代码,并得到以下错误:item['image_urls'] = urljoin_rfc(base_url, relative_url) File "/Library/Python/2.6/site-packages/Scrapy-0.12.0.2541-py2.6.egg/ scrapy/utils/url.py”,第 37 行,在 urljoin_rfc unicode_to_str(ref, encoding)) 文件“/Library/Python/2.6/site-packages/Scrapy-0.12.0.2541-py2.6.egg/scrapy/utils/ python.py", line 96, in unicode_to_str raise TypeError('unicode_to_str must receive a unicode or str object, got %s' % type(text).__name__) exceptions.TypeError: unicode_to_str must receive a unicode or str object, got list
    • 您能否发布导致错误的代码片段(使用代码更新您的问题)。您传递的对象既不是字符串也不是 unicode,因此会出现此错误。在此处搜索错误dev.scrapy.org/browser/scrapy/utils/python.py?rev=1103,您会看到是什么原因造成的
    • 刚刚更新了我的问题,并包含了我得到的完整错误。还将查看您在上面包含的链接。谢谢。
    【解决方案4】:

    几个注意事项:

    items = []
    for site in sites:
        item = DmozItem()
        item['manufacturer'] = 'Namaste Foods'
        ...
        items.append(item)
    return items
    

    我做的不一样:

    for site in sites:
        item = DmozItem()
        item['manufacturer'] = 'Namaste Foods'
        ...
        yield item
    

    然后:

    relative_url = site.select('//*[@id="showImage"]/@src').extract()
    item['image_urls'] = urljoin_rfc(base_url, relative_url)
    

    extract() 总是返回一个列表,因为 xpath 查询总是返回一个选定节点的列表。

    这样做:

    relative_url = site.select('//*[@id="showImage"]/@src').extract()[0]
    item['image_urls'] = urljoin_rfc(base_url, relative_url)
    

    【讨论】:

    • 不要忘记,自从 0.14 以来 urljoin_rfc 已被弃用,因为 Pablo Hoffman(Scrapy 开发人员)指出来自 urlparse 的 urljoin 就足够了。
    【解决方案5】:

    获取绝对网址的更通用方法是

    import urlparse
    
    def abs_url(url, response):
      """Return absolute link"""
      base = response.xpath('//head/base/@href').extract()
      if base:
        base = base[0]
      else:
        base = response.url
      return urlparse.urljoin(base, url)
    

    base element 存在时,这也有效。

    在你的情况下,你会这样使用它:

    def parse(self, response):
      # ...
      for site in sites:
        # ...
        image_urls = site.select('//*[@id="showImage"]/@src').extract()
        if image_urls: item['image_urls'] = abs_url(image_urls[0], response)
    

    【讨论】:

      猜你喜欢
      • 2017-08-01
      • 2011-11-07
      • 2011-05-02
      • 2016-12-14
      相关资源
      最近更新 更多