Python - 请求模块，获取域名？答案

【问题标题】：Python - Requests module, getting the domain name?Python - 请求模块，获取域名？
【发布时间】：2016-01-04 21:01:05
【问题描述】：

我正在尝试使用requests 模块构建网络爬虫，基本上我想要它做的是去一个网页，获取所有href，然后将它们写入一个文本文件。

到目前为止，我的代码如下所示：

def getLinks(url):
response = requests.get(url).text
soup = BeautifulSoup(response,"html.parser")
for link in soup.findAll("a"):
    print("Link:"+str(link.get("href")))

适用于某些网站但我试图在href 上使用它的不是像“www.google.com”这样的完整域名，而是......指向重定向到链接的目录的路径？

看起来像这样：

href="/out/101"

如果我尝试将其写入文件，它看起来像这样

 1. /out/101
 2. /out/102
 3. /out/103
 4. /out/104

这不是我真正想要的。

我该如何从这些链接中获取域名？

【问题讨论】：

标签： python python-requests

【解决方案1】：

这意味着 URL 是 相对于当前的。要获取完整的 URL，请使用 urljoin():

from urlparse import urljoin

for link in soup.findAll("a"): 
    full_url = urljoin(url, link.get("href"))
    print("Link:" + full_url)

【讨论】：

是的，但这只会给出重定向到实际站点的页面的完整 url，但是我如何获取重定向到的站点的 url？ :P
@stav 向它发出请求并获取response.url。如果您需要记录重定向链，请参阅stackoverflow.com/questions/20475552/…。

【解决方案2】：

试试下面的代码。它将为您提供网站上的所有链接。如果您知道该网站的base url，那么您可以从中提取所有其他网址。整个网页抓取代码在这里WebScrape

import requests
import lxml.html
from bs4 import BeautifulSoup

def extractLinks(url, base):
        '''
        Return links from the website
        :param url: Pass the url
        :param base: this is the base links
        :return: list of links
        '''
        links = [] #it will contain all the links from the website
        try:
            r = requests.get(url)
        except:
            return []
        obj = lxml.html.fromstring(r.text)
        potential_links = obj.xpath("//a/@href")
        links.append(r.url)
        #print potential_links
        for link in potential_links:
            if base in link:
                links.append(link)
            else:
                if link.startswith("http"):
                    links.append(link)

                elif base.endswith("/"):
                    if link.startswith("/"):
                        link = link.lstrip("/")
                        link = base + link
                    else:
                        link = base + link
                    links.append(link)

        return links

extractLinks('http://data-interview.enigmalabs.org/companies/',
    'http://data-interview.enigmalabs.org/')

【讨论】：