【问题标题】:Yet another Python encoding issue另一个 Python 编码问题
【发布时间】:2019-06-16 10:58:24
【问题描述】:

我已经尝试过这段代码:

def process_request(url):
    req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
    return urlopen(req).read()

def get_links():
    url = c.first_url
    html = process_request(url)
    details_pages = []
    soup = BeautifulSoup(html, 'html.parser')
    links = soup.select(".pagelist-bar  a")
    print(links)
    for l in links:
        print(l)
        if l.has_attr('href'):
            href_ = l['href']
            detail = c.base_url + href_
            logging.info("Page with List of persons: %s", detail)
            details_pages.append(detail)
    return details_pages

def person_urls():
    pages = get_links()
    for l in pages:
        print("link: %s", l)
        doc = process_request(l)
        soup = BeautifulSoup(doc, 'html.parser')
        fichas = soup.select(".ficha")
        print(fichas)

在这个网址中: http://www.guardiacivil.es/es/colaboracion/buscados/index.html

无论我使用什么策略,这行:

<a href="/es/colaboracion/buscados/index.html?buscar=si&category=abcd&notshown=">

总是转换为:

<a href="/es/colaboracion/buscados/index.html?buscar=si&category=abcd¬shown=">

&notshown= 变为 ¬shown=

我已经尝试了THESE POSTS 上的一些建议,但目前没有结果。

除了总是有错误:

  self._send_request(method, url, body, headers, encode_chunked)
  File "/usr/lib/python3.6/http/client.py", line 1250, in _send_request
  self.putrequest(method, url, **skips)
  File "/usr/lib/python3.6/http/client.py", line 1117, in putrequest
  self._output(request.encode('ascii'))
UnicodeEncodeError: 'ascii' codec can't encode character '\xac' in position 69: ordinal not in range(128)

谁能帮帮我?

【问题讨论】:

    标签: python-3.x beautifulsoup urllib


    【解决方案1】:

    也许您应该尝试在 BeautifulSoup 调用中将 html.parser 替换为 html

    soup = BeautifulSoup(html, 'html')
    links = soup.select(".pagelist-bar  a")
    #Ouptut
    for x in links:
        print(x.get('href'))
    

    输出:

    /es/colaboracion/buscados/index.html?pagina=1&buscar=si&category=&notshown=
    /es/colaboracion/buscados/index.html?pagina=2&buscar=si&category=&notshown=
    /es/colaboracion/buscados/index.html?pagina=3&buscar=si&category=&notshown=
    /es/colaboracion/buscados/index.html?pagina=4&buscar=si&category=&notshown=
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2011-11-04
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2013-10-01
      • 2015-03-15
      • 1970-01-01
      • 2013-09-10
      相关资源
      最近更新 更多