如何使用 Python 解析链接？答案

【问题标题】：How Do I Resolve Links with Python?如何使用 Python 解析链接？
【发布时间】：2016-02-12 03:59:07
【问题描述】：

此函数将字符串作为输入，如果字符串以http:// 开头或字符串以https:// 开头，该函数将假定该字符串为绝对链接。如果 URL 以 / 开头，该函数会将其转换为绝对链接。

请注意，base 目前是一个全局变量。我主要担心的是这个函数做了太多假设。有没有办法在没有那么多假设的情况下完成解析 URL 的任务？

def get_url(item):
    #absolute link
    if item.startswith('http://') or item.startswith('https://'):
        url = item
    #root-relative link
    elif item.startswith('/'):
        url = base + item
    else:
        url = base + "/" + item
    return url

【问题讨论】：

尝试使用urlparse 模块。
如果您不介意，您能否使用 urlparse 模块实现此功能。如果不是那也很好。

标签： python http parsing url hyperlink

【解决方案1】：

使用urlparse 模块中的urljoin。

from urlparse import urljoin

base = 'http://myserver.com'

def get_url(item):
    return urljoin(base, item)

urljoin 自己处理绝对或相对链接。

示例

print get_url('/paul.html')
print get_url('//otherserver.com/paul.html')
print get_url('https://paul.com/paul.html')
print get_url('dir/paul.html')

输出

http://myserver.com/paul.html
http://otherserver.com/paul.html
https://paul.com/paul.html
http://myserver.com/dir/paul.html

【讨论】：

您必须记住，我们不知道 item 是相对链接还是绝对链接。或者如果 item 甚至是一个链接。
我的编辑是否阐明了第一点？如果它不是链接，您希望该功能做什么？
为什么使用正则表达式而不是python字符串方法startswith？我可能错过了重点。
我很抱歉，这是针对下面的帖子。

【解决方案2】：

1-使用正则表达式

2-在您的基本网址中添加一个尾随 /

import re        
base = 'http://www.example.com/'

def get_url(item):
    #absolute link
    pattern = "(http|https)://[\w\-]+(\.[\w\-]+)+\S*"  # regex pattern to approve http and https started strings
    if re.search(pattern, item):
        url = item
    #root-relative link
    else:
        url = base + item.lstrip('/')
    return url

【讨论】：

为什么使用正则表达式而不是python字符串方法startswith？我可能错过了重点。
@RickyWilson ：当您要匹配字符串模式时，最好使用正则表达式而不是使用字符串方法，这会减少并且更清晰