Python Scrapy：将相对路径转换为绝对路径答案

【问题标题】：Python Scrapy: Convert relative paths to absolute pathsPython Scrapy：将相对路径转换为绝对路径
【发布时间】：2011-06-27 22:19:35
【问题描述】：

我已经根据这里的伟人提供的解决方案修改了代码；我在这里得到代码下方显示的错误。

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.utils.response import get_base_url
from scrapy.utils.url import urljoin_rfc
from dmoz2.items import DmozItem

class DmozSpider(BaseSpider):
   name = "namastecopy2"
   allowed_domains = ["namastefoods.com"]
   start_urls = [
    "http://www.namastefoods.com/products/cgi-bin/products.cgi?Function=show&Category_Id=4&Id=1",
    "http://www.namastefoods.com/products/cgi-bin/products.cgi?Function=show&Category_Id=4&Id=12",    

]

def parse(self, response):
    hxs = HtmlXPathSelector(response)
    sites = hxs.select('/html/body/div/div[2]/table/tr/td[2]/table/tr')
    items = []
    for site in sites:
        item = DmozItem()
        item['manufacturer'] = 'Namaste Foods'
        item['productname'] = site.select('td/h1/text()').extract()
        item['description'] = site.select('//*[@id="info-col"]/p[7]/strong/text()').extract()
        item['ingredients'] = site.select('td[1]/table/tr/td[2]/text()').extract()
        item['ninfo'] = site.select('td[2]/ul/li[3]/img/@src').extract()
        #insert code that will save the above image path for ninfo as an absolute path
        base_url = get_base_url(response)
        relative_url = site.select('//*[@id="showImage"]/@src').extract()
        item['image_urls'] = urljoin_rfc(base_url, relative_url)
        items.append(item)
    return items

我的 items.py 看起来像这样：

from scrapy.item import Item, Field

class DmozItem(Item):
    # define the fields for your item here like:
    productid = Field()
    manufacturer = Field()
    productname = Field()
    description = Field()
    ingredients = Field()
    ninfo = Field()
    imagename = Field()
    image_paths = Field()
    relative_images = Field()
    image_urls = Field()
    pass

我需要蜘蛛为 items['relative_images'] 获取的相对路径转换为绝对路径并保存在 items['image_urls'] 中，以便我可以从这个蜘蛛本身下载图像。例如，蜘蛛获取的 relative_images 路径是 '../../files/images/small/8270-BrowniesHiResClip.jpg'，这应该转换为 'http://namastefoods.com/files/images/small /8270-BrowniesHiResClip.jpg', & 存储在 items['image_urls']

我还需要将 items['ninfo'] 路径存储为绝对路径。

运行上述代码时出错：

2011-06-28 17:18:11-0400 [scrapy] INFO: Scrapy 0.12.0.2541 started (bot: dmoz2)
2011-06-28 17:18:11-0400 [scrapy] DEBUG: Enabled extensions: TelnetConsole, SpiderContext, WebService, CoreStats, CloseSpider
2011-06-28 17:18:11-0400 [scrapy] DEBUG: Enabled scheduler middlewares: DuplicatesFilterMiddleware
2011-06-28 17:18:11-0400 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMiddleware, DownloaderStats
2011-06-28 17:18:11-0400 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2011-06-28 17:18:11-0400 [scrapy] DEBUG: Enabled item pipelines: MyImagesPipeline
2011-06-28 17:18:11-0400 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2011-06-28 17:18:11-0400 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2011-06-28 17:18:11-0400 [namastecopy2] INFO: Spider opened
2011-06-28 17:18:12-0400 [namastecopy2] DEBUG: Crawled (200) <GET http://www.namastefoods.com/products/cgi-bin/products.cgi?Function=show&Category_Id=4&Id=12> (referer: None)
2011-06-28 17:18:12-0400 [namastecopy2] ERROR: Spider error processing <http://www.namastefoods.com/products/cgi-bin/products.cgi?Function=show&Category_Id=4&Id=12> (referer: <None>)
    Traceback (most recent call last):
      File "/System/Library/Frameworks/Python.framework/Versions/2.6/Extras/lib/python/twisted/internet/base.py", line 1137, in mainLoop
        self.runUntilCurrent()
      File "/System/Library/Frameworks/Python.framework/Versions/2.6/Extras/lib/python/twisted/internet/base.py", line 757, in runUntilCurrent
        call.func(*call.args, **call.kw)
      File "/System/Library/Frameworks/Python.framework/Versions/2.6/Extras/lib/python/twisted/internet/defer.py", line 243, in callback
        self._startRunCallbacks(result)
      File "/System/Library/Frameworks/Python.framework/Versions/2.6/Extras/lib/python/twisted/internet/defer.py", line 312, in _startRunCallbacks
        self._runCallbacks()
    --- <exception caught here> ---
      File "/System/Library/Frameworks/Python.framework/Versions/2.6/Extras/lib/python/twisted/internet/defer.py", line 328, in _runCallbacks
        self.result = callback(self.result, *args, **kw)
      File "/***/***/***/***/***/***/spiders/namaste_copy2.py", line 30, in parse
        item['image_urls'] = urljoin_rfc(base_url, relative_url)
      File "/Library/Python/2.6/site-packages/Scrapy-0.12.0.2541-py2.6.egg/scrapy/utils/url.py", line 37, in urljoin_rfc
        unicode_to_str(ref, encoding))
      File "/Library/Python/2.6/site-packages/Scrapy-0.12.0.2541-py2.6.egg/scrapy/utils/python.py", line 96, in unicode_to_str
        raise TypeError('unicode_to_str must receive a unicode or str object, got %s' % type(text).__name__)
    exceptions.TypeError: unicode_to_str must receive a unicode or str object, got list

2011-06-28 17:18:15-0400 [namastecopy2] DEBUG: Crawled (200) <GET http://www.namastefoods.com/products/cgi-bin/products.cgi?Function=show&Category_Id=4&Id=1> (referer: None)
2011-06-28 17:18:15-0400 [namastecopy2] ERROR: Spider error processing <http://www.namastefoods.com/products/cgi-bin/products.cgi?Function=show&Category_Id=4&Id=1> (referer: <None>)
    Traceback (most recent call last):
      File "/System/Library/Frameworks/Python.framework/Versions/2.6/Extras/lib/python/twisted/internet/base.py", line 1137, in mainLoop
        self.runUntilCurrent()
      File "/System/Library/Frameworks/Python.framework/Versions/2.6/Extras/lib/python/twisted/internet/base.py", line 757, in runUntilCurrent
        call.func(*call.args, **call.kw)
      File "/System/Library/Frameworks/Python.framework/Versions/2.6/Extras/lib/python/twisted/internet/defer.py", line 243, in callback
        self._startRunCallbacks(result)
      File "/System/Library/Frameworks/Python.framework/Versions/2.6/Extras/lib/python/twisted/internet/defer.py", line 312, in _startRunCallbacks
        self._runCallbacks()
    --- <exception caught here> ---
      File "/System/Library/Frameworks/Python.framework/Versions/2.6/Extras/lib/python/twisted/internet/defer.py", line 328, in _runCallbacks
        self.result = callback(self.result, *args, **kw)
      File "/***/***/***/***/***/***/spiders/namaste_copy2.py", line 30, in parse
        item['image_urls'] = urljoin_rfc(base_url, relative_url)
      File "/Library/Python/2.6/site-packages/Scrapy-0.12.0.2541-py2.6.egg/scrapy/utils/url.py", line 37, in urljoin_rfc
        unicode_to_str(ref, encoding))
      File "/Library/Python/2.6/site-packages/Scrapy-0.12.0.2541-py2.6.egg/scrapy/utils/python.py", line 96, in unicode_to_str
        raise TypeError('unicode_to_str must receive a unicode or str object, got %s' % type(text).__name__)
    exceptions.TypeError: unicode_to_str must receive a unicode or str object, got list

2    011-06-28 17:18:15-0400 [namastecopy2] INFO: Closing spider (finished)
2011-06-28 17:18:15-0400 [namastecopy2] INFO: Spider closed (finished)

谢谢。-TM

【问题讨论】：

不要在没有创建 cmets 的情况下更新您的问题 - 否则我们不会收到通知，也不知道您需要更多信息。如果发现任何有用的回复 - 也可以投票。
也把你的日志/回溯到代码块中
别忘了给你觉得有用的回复点赞

标签： python scrapy imagesource

【解决方案1】：

来自Scrapy docs：

def parse(self, response):
    # ... code ommited
    next_page = response.urljoin(next_page)
    yield scrapy.Request(next_page, self.parse)

也就是说，response 对象有一个方法可以做到这一点。

【讨论】：

【解决方案2】：

我做的是：

import urlparse
...

def parse(self, response):
    ...
    urlparse.urljoin(response.url, extractedLink.strip())
    ...

通知strip()，因为我有时会遇到奇怪的链接，例如：

<a href="
              /MID_BRAND_NEW!%c2%a0MID_70006_Google_Android_2.2_7%22%c2%a0Tablet_PC_Silver/a904326516.html
            ">MID BRAND NEW!&nbsp;MID 70006 Google Android 2.2 7"&nbsp;Tablet PC Silver</a>

【讨论】：

值得补充的是，url 不是由 urljoin() 连接的，而不是像 netloc 或 path 这样的 url 部分被覆盖。因此urljoin('http://www.myeshop.com/category/subcategory', '/category/subcategory/item001.php') 不会返回http://www.myeshop.com/category/subcategory/category/subcategory/item001.php，而是更明智的http://www.myeshop.com/category/subcategory/item001.php。
警告：对于 python 3 根据doc：“urlparse 模块在 Python 3 中被重命名为 urllib.parse。2to3 工具将在将源转换为 Python 3 时自动调整导入。”
在 Python 3 中它变成：import urllib.parse 并使用它urllib.parse.urljoin(response.url, extractedLink.strip())

【解决方案3】：

from scrapy.utils.response import get_base_url

base_url           = get_base_url(response)
relative_url       = site.select('//*[@id="showImage"]/@src').extract()
item['image_urls'] = [urljoin_rfc(base_url,ru) for ru in relative_url]

或者你可以只提取一项

base_url           = get_base_url(response)
relative_url       = site.select('//*[@id="showImage"]/@src').extract()[0]
item['image_urls'] = urljoin_rfc(base_url,relative_url)

错误是因为您将列表而不是 str 传递给 urljoin 函数。

【讨论】：

感谢@buffer。我在上面尝试了您的代码，并得到以下错误：item['image_urls'] = urljoin_rfc(base_url, relative_url) File "/Library/Python/2.6/site-packages/Scrapy-0.12.0.2541-py2.6.egg/ scrapy/utils/url.py”，第 37 行，在 urljoin_rfc unicode_to_str(ref, encoding)) 文件“/Library/Python/2.6/site-packages/Scrapy-0.12.0.2541-py2.6.egg/scrapy/utils/ python.py", line 96, in unicode_to_str raise TypeError('unicode_to_str must receive a unicode or str object, got %s' % type(text).__name__) exceptions.TypeError: unicode_to_str must receive a unicode or str object, got list
您能否发布导致错误的代码片段（使用代码更新您的问题）。您传递的对象既不是字符串也不是 unicode，因此会出现此错误。在此处搜索错误dev.scrapy.org/browser/scrapy/utils/python.py?rev=1103，您会看到是什么原因造成的
刚刚更新了我的问题，并包含了我得到的完整错误。还将查看您在上面包含的链接。谢谢。

【解决方案4】：

几个注意事项：

items = []
for site in sites:
    item = DmozItem()
    item['manufacturer'] = 'Namaste Foods'
    ...
    items.append(item)
return items

我做的不一样：

for site in sites:
    item = DmozItem()
    item['manufacturer'] = 'Namaste Foods'
    ...
    yield item

然后：

relative_url = site.select('//*[@id="showImage"]/@src').extract()
item['image_urls'] = urljoin_rfc(base_url, relative_url)

extract() 总是返回一个列表，因为 xpath 查询总是返回一个选定节点的列表。

这样做：

relative_url = site.select('//*[@id="showImage"]/@src').extract()[0]
item['image_urls'] = urljoin_rfc(base_url, relative_url)

【讨论】：

不要忘记，自从 0.14 以来 urljoin_rfc 已被弃用，因为 Pablo Hoffman（Scrapy 开发人员）指出来自 urlparse 的 urljoin 就足够了。

【解决方案5】：

获取绝对网址的更通用方法是

import urlparse

def abs_url(url, response):
  """Return absolute link"""
  base = response.xpath('//head/base/@href').extract()
  if base:
    base = base[0]
  else:
    base = response.url
  return urlparse.urljoin(base, url)

当base element 存在时，这也有效。

在你的情况下，你会这样使用它：

def parse(self, response):
  # ...
  for site in sites:
    # ...
    image_urls = site.select('//*[@id="showImage"]/@src').extract()
    if image_urls: item['image_urls'] = abs_url(image_urls[0], response)

【讨论】：