【问题标题】:Scrapy xpath utf-8 literalsScrapy xpath utf-8 文字
【发布时间】:2023-03-15 05:33:01
【问题描述】:

我需要检查包含非 ascii 字符的抓取字段。当我在蜘蛛中包含一个 utf-8 文字时,我得到了这个错误:

ValueError:所有字符串必须与 XML 兼容:Unicode 或 ASCII,没有 NULL 字节或控制字符

这是一个产生错误的例子

# -*- coding: utf-8 -*-
import scrapy

class DummySpider(scrapy.Spider):
    name = 'dummy'
    start_urls = ['http://www.google.com']

    def parse(self, response):
        dummy = response.xpath("//*[contains(.,u'café')]")

这是回溯:

Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 577, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "/tmp/stack.py", line 9, in parse
    dummy = response.xpath("//*[contains(.,u'café')]")
  File "/usr/lib/pymodules/python2.7/scrapy/http/response/text.py", line 109, in xpath
    return self.selector.xpath(query)
  File "/usr/lib/pymodules/python2.7/scrapy/selector/unified.py", line 97, in xpath
    smart_strings=self._lxml_smart_strings)
  File "lxml.etree.pyx", line 1509, in lxml.etree._Element.xpath (src/lxml/lxml.etree.c:50702)
  File "xpath.pxi", line 306, in lxml.etree.XPathElementEvaluator.__call__ (src/lxml/lxml.etree.c:145829)
  File "apihelpers.pxi", line 1395, in lxml.etree._utf8 (src/lxml/lxml.etree.c:26485)
ValueError: All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters

【问题讨论】:

  • 什么版本的python?咖啡馆周围的那些单引号字符是反引号还是反引号?

标签: python unicode utf-8 scrapy


【解决方案1】:
"//*[contains(.,u'café')]"

u'' 字符串文字是 Python 语法,不是 XPath 的一部分。试试:

u"//*[contains(.,'café')]"

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 2018-01-16
    • 1970-01-01
    • 1970-01-01
    • 2016-08-24
    • 2011-08-07
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多