【发布时间】:2019-06-16 10:58:24
【问题描述】:
我已经尝试过这段代码:
def process_request(url):
req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
return urlopen(req).read()
def get_links():
url = c.first_url
html = process_request(url)
details_pages = []
soup = BeautifulSoup(html, 'html.parser')
links = soup.select(".pagelist-bar a")
print(links)
for l in links:
print(l)
if l.has_attr('href'):
href_ = l['href']
detail = c.base_url + href_
logging.info("Page with List of persons: %s", detail)
details_pages.append(detail)
return details_pages
def person_urls():
pages = get_links()
for l in pages:
print("link: %s", l)
doc = process_request(l)
soup = BeautifulSoup(doc, 'html.parser')
fichas = soup.select(".ficha")
print(fichas)
在这个网址中: http://www.guardiacivil.es/es/colaboracion/buscados/index.html
无论我使用什么策略,这行:
<a href="/es/colaboracion/buscados/index.html?buscar=si&category=abcd¬shown=">
总是转换为:
<a href="/es/colaboracion/buscados/index.html?buscar=si&category=abcd¬shown=">
¬shown= 变为 ¬shown=
我已经尝试了THESE POSTS 上的一些建议,但目前没有结果。
除了总是有错误:
self._send_request(method, url, body, headers, encode_chunked)
File "/usr/lib/python3.6/http/client.py", line 1250, in _send_request
self.putrequest(method, url, **skips)
File "/usr/lib/python3.6/http/client.py", line 1117, in putrequest
self._output(request.encode('ascii'))
UnicodeEncodeError: 'ascii' codec can't encode character '\xac' in position 69: ordinal not in range(128)
谁能帮帮我?
【问题讨论】:
标签: python-3.x beautifulsoup urllib