Python + BeautifulSoup：如何从 href 属性获取完整链接？答案

【问题标题】：Python + BeautifulSoup: How can I get full link from href attribute?Python + BeautifulSoup：如何从 href 属性获取完整链接？
【发布时间】：2021-12-25 20:41:21
【问题描述】：

我正在整理一个网络爬虫进行练习和学习，但发现了一些问题。我最初的思考过程是……

在给定页面上，查找所有 href 属性。如果 href 值是有效链接，请转到此新链接并继续
如果 href 值是路径（例如“/patients/patient-portal”或“/services/financial-assistance”），我会将其附加到当前 URL 的末尾并再次继续.

出现了一个我没有意识到的问题。一些路径引用了网站上的其他资源。（包括图片）。当前的 url 是“patients-visitors/advance-directives/”，而资源“services/family-medicine”实际上是指 columbiabasinhospital.org/services/family-medicine”。我设置它的方式会导致 URL 不正确（ patient-visitors/advance-directives/services/family-medicine）。将鼠标悬停在资源上会显示完整链接。我想知道是否有办法使用 BeautifulSoup 检索它？谢谢！

【问题讨论】：

我会将其附加到我当前所在 URL 的末尾 - 为什么是当前 url，你应该添加基本 url - columbiabasinhospital.org

标签： python beautifulsoup web-crawler

【解决方案1】：

您可以使用from urllib.parse import urljoin。但是，你也可以自己写！

假设当前网址为：http://example.com/path1/path2

当href属性的值为/x时，必须将其添加到根路径，即http://example.com/x

但是，当 href 属性的值为 ./x 或 x 时，您需要将其添加到整个地址，即 http://example.com/path1/x

【讨论】：

urljoin 两种情况都可以
@Musa 我现在测试它，它不能处理第二个！！！
@AlirezaKavian 在 URL 为 http://example.com/path1 的页面上带有 href x 的链接将链接到 http://example.com/x。这是预期的行为。当前没有斜杠的 URL 会影响加入
@IainShelvington 是的，你是对的。 /x 或 x 是相同的。我修改了它。
@AlirezaKavian http://example.com/path1 加入 ./x 仍应为 http://example.com/x，返回 http://example.com/path1/x 将不正确

【解决方案2】：

使用 urllib.parse.urljoin 从基本 URL 和另一个可能相对的 URL/路径返回正确的 URL

from urllib.parse import urljoin

new_url = urljoin(current_url, href)

例如

urljoin('http://localhost/foo/bar/', '/baz/')
# Outputs 'http://localhost/baz/'

【讨论】：