Beautiful soup 将 URL 中的某些符号替换为其他符号答案

【问题标题】：Beautiful soup replaces certain symbols in a URL with other symbolsBeautiful soup 将 URL 中的某些符号替换为其他符号
【发布时间】：2018-03-06 02:24:57
【问题描述】：

我正在用 Beautiful soup 解析某个网页，试图检索 h3 标签内的所有链接：

page = = requests.get(https://www....)
soup = BeautifulSoup(page.text, "html.parser")
links = []
for item in soup.find_all('h3'):
 links.append(item.a['href']

但是，找到的链接与页面中的链接不同。例如，当页面中存在链接http://www.estense.com/?p=116872 时，Beautiful soup 返回http://www.estense.com/%3Fp%3D116872，替换“？” '%3F' 和 '=' 和 %3D。这是为什么呢？

谢谢。

【问题讨论】：

这是 url 转义。但我无法重现此问题。你用的是什么版本的 Python？
我使用 Python 3.5.3。

标签： web-scraping character-encoding beautifulsoup

【解决方案1】：

您可以使用 urllib.parse 取消引用 URL

from urllib import parse
parse.unquote(item.a['href'])

【讨论】：

谢谢，您能解释一下这个问题的根源吗？
原因可能是来自<a href的链接被编码，当您查看源代码时您永远不会知道这一点，因为我们的浏览器会自动解码URL