【发布时间】:2014-12-24 06:33:35
【问题描述】:
我收到了一个网址:https://www.packtpub.com/virtualization-and-cloud/citrix-xenapp®-75-desktop-virtualization-solutions;它来自 BeautifulSoup。
url=u'https://www.packtpub.com/virtualization-and-cloud/citrix-xenapp\xae-75-desktop-virtualization-solutions'
我想再次反馈到 urllib2.urlopen。
import urllib2
source = urllib2.urlopen(url).read()
我得到的错误:
UnicodeEncodeError: 'gbk' codec can't encode character u'\xae' in position 43: illegal multibyte sequence
因此,我尝试了:
source = urllib2.urlopen(url.encode("utf-8")).read()
它有页面源,但它与原始网址不同。
originalUrl = 'https://www.packtpub.com/virtualization-and-cloud/citrix-xenapp®-75-desktop-virtualization-solutions'
originalSource = urllib2.urlopen(originalUrl).read()
originalSource == source
结果为假。有什么想法可以修复此网址吗?如何将 u'\xae' 转换成原来的®?
【问题讨论】:
标签: python urllib2 python-unicode urlopen