如何删除“www”。从原始 URL 通过 [urllib] 在 python 中解析？

【问题标题】：How can I remove 'www.' from original URL through [urllib] parse in python?如何删除“www”。从原始 URL 通过 [urllib] 在 python 中解析？
【发布时间】：2021-10-01 01:21:16
【问题描述】：

原网址▶https://www.exeam.org/index.html

我想从原始 URL 中提取 exeam.org/ 或 exeam.org。

为此，我使用了我所知道的 Python 中最强大的解析器 urllib，但不幸的是 urllib (url.scheme, url.netloc ...) 无法提供我想要的格式类型。

【问题讨论】：

'.'.join(urlparse('https://www.exeam.org/index.html').netloc.split('.')[1:]) stackoverflow.com/questions/44113335/…
not only the original URL of the Inquiry but also the majority 是什么意思？很抱歉不明白。

标签： python parsing url urllib

【解决方案1】：

使用`urllib从url中提取域名）：

from urllib.parse import urlparse
surl = "https://www.exam.org/index.html"
urlparsed = urlparse(surl)
# network location from parsed url
print(urlparsed.netloc)
# ParseResult Object
print(urlparsed)

这将为您提供www.exam.org，但如果您只是在exam.org 部分之后，您希望将其进一步分解为注册域。因此，除了进行简单的拆分（这可能就足够了）之外，您还可以使用诸如 tldextract 之类的库，它知道如何解析子域、后缀等：

from  tldextract import extract

ext = extract(surl)
print(ext.registered_domain)

这将产生：

exam.org

【讨论】：