检查一个 URL 是否相对于另一个（即它们在同一主机上）答案

【问题标题】：Check if an URL is relative to another (ie. they are on the same host)检查一个 URL 是否相对于另一个（即它们在同一主机上）
【发布时间】：2015-08-31 15:02:46
【问题描述】：

我有一个基本的 HTTP URL 和一个其他 HTTP URL 的列表。我正在编写一个简单的爬虫/链接检查器作为一项研究（因此，不需要建议预先编写的工具），它检查基本 URL 是否有任何损坏的链接，并递归地爬取所有其他“内部”页面（即。从同一站点内的基本 URL 链接的页面）具有相同的意图。最后，我必须输出链接列表及其状态（外部/内部，以及每个实际上是内部但显示为绝对 URL 的链接的警告。

到目前为止，我 succeeded 检查所有链接并使用请求和 BeautifulSoup 库进行爬网，但我找不到已经编写好的方法来检查两个绝对 URL 是否指向同一个站点（除了拆分 URL沿着斜线，这对我来说似乎很难看）。有这方面的知名图书馆吗？

【问题讨论】：

你的意思是 urlparse 之类的吗？docs.python.org/2/library/urlparse.html
没错！我想知道我是怎么错过的。我将在明天早上（CEST）发布我的解决方案。谢谢！

标签： python url

【解决方案1】：

最后我选择了urlparse（感谢@padraic-cunningham 为我指出了它）。在代码的开头，我解析了“基本 URL”（即我开始抓取的那个）：

base_parts = urlparse.urlparse(base_url)

然后对于我找到的每个链接（例如 for a in soup.find_all('a'):

link_parts = urlparse.urlparse(a.get('href'))

此时我必须比较 URL 方案（我认为指向同一站点的链接具有不同的 URL 方案，http 或 https，是不同的；我将来可能会将此比较设为可选）：

internal = base_parts.scheme == link_parts.scheme \
           and base_parts.netloc == link_parts.netloc

到这里，如果链接指向与我的基本 URL 相同的服务器（具有相同的方案），则内部将为 True。您可以查看最终结果here。

【讨论】：

【解决方案2】：

我自己写了一个爬虫。我希望这会对你有所帮助。基本上我所做的是将网址添加到 /2/2/3/index.php 之类的网站，这将使该网站成为 http://www.website.com/2/2/3/index.php 。然后我将所有网站插入到一个数组中，检查我之前是否访问过这个网站，如果我访问过，它就不会访问那里。此外，如果此站点中有一些不相关的网站，例如指向 youtube 视频的链接，则它也不会抓取 youtube 或任何其他与“网站无关”的网站。

对于您的问题，我建议您将所有访问过的网站放入一个数组中，并使用 for 循环检查该数组。如果 URL 与数组相同，则打印它。

我不确定这是否是您想要的，但至少我尝试过。我没有使用 BeautifulSoup，它仍然有效，因此请考虑将该模块放在一边。

我的脚本（更像是它的一部分。我也得到了异常检查，所以不要惊慌）：

__author__ = "Sploit"


# This part is about import the default python modules and the modules that the user have to download
# If the module does not exist, the script asks him to install that specific module

import os  # This module provides a portable way of using operating system dependent functionality
import urllib  # The urllib module provides a simple interface for network resource access
import urllib2  # The urllib2 module provides a simple interface for network resource access
import time  # This module provides various time-related functions
import urlparse  # This module defines a standard interface to break URL strings up in components
                 # to combine the components back into a URL string, and to convert a relative URL to an absolute URL given a base URL.
import mechanize

print ("Which website would you like to crawl?")
website_url = raw_input("--> ")

# Ads http:// to the given URL because it is the only way to check for server response
# If the user will add to the URL directions then they will be deleted
# Example: 'https://moz.com/learn/seo/external-link' will turn to 'https://moz.com/'
if website_url.split('//')[0] != 'http:' and website_url.split('//')[0] != 'https:':
    website_url = 'http://' + website_url
website_url = website_url.split('/')[0] + '//' + website_url.split('/')[2]

# The user will stuck in a loop until a valid website is exist, using the application layer of the OSI module, HTTP Protocol
while True:
    try:
        if urllib2.urlopen(website_url).getcode() != 200:
            print ("Invalid URL given. Which website would you like to crawl?")
            website_url = raw_input("--> ")
        else:
            break
    except:
        print ("Invalid URL given. Which website would you like to crawl?")
        website_url = raw_input("--> ")

# This part is the actual the Web Crawler
# What it does is to search for links
# All the URLs that are not the websites URLs are printed in a txt file named "Non website links"


fake_browser = mechanize.Browser()  # Set the starting point for the spider and initialize the a mechanize browser object
urls = [website_url]  # Create lists for the URLs that the script should go through
visited = [website_url]  # Create lists that we have visited in, to avoid multiplies
text_file = open("Non website links.txt", "w")  # We create a txt file for all the URLs that are not the websites URLs
text_file_url = open("Website links.txt", "w")  # We create a txt file for all the URLs that are the websites URLs

print ("Crawling : " + website_url)
print ("The crawler started at " + time.asctime(time.localtime()) + ". This may take a couple of minutes")  # To let the user know when the crawler started to work
# Since the amount of urls in the list is dynamic we just let the spider go until some last url didn't have new ones on the website
while len(urls) > 0:
    try:
        fake_browser.open(urls[0])
        urls.pop(0)
        for link in fake_browser.links():  # A loop which looking for all the images in the website
            new_website_url = urlparse.urljoin(link.base_url, link.url)  # Create a new url with the websites link that is acceptable as HTTP
            if new_website_url not in visited and website_url in new_website_url:  # If we have been in this website, don't enter the URL to the list, to avoid multiplies
                visited.append(new_website_url)
                urls.append(new_website_url)
                print ("Found: " + new_website_url)  # Print all the links that the crawler found
                text_file_url.write(new_website_url + '\n')  # Print the non-website URL to the txt file
            elif new_website_url not in visited and website_url not in new_website_url:
                visited.append(new_website_url)
                text_file.write(new_website_url + '\n')  # Print the non-website URL to the txt file
    except:
        print ("Link couldn't be opened")
        urls.pop(0)

text_file.close()  # Close the txt file, to prevent anymore writing to it
text_file_url.close()  # Close the txt file, to prevent anymore writing to it
print ("A txt file with all the website links has been created in your folder")
print ("Finished!!")

【讨论】：

这在某种程度上我已经做过了，除了你使用 mechanize 而不是 request+beautifulsoup（我也会考虑）。 Wy 现在的目标是检查链接是否是内部绝对投注； IE。一个绝对的“gergely.polonkai.eu/blog”链接，而不是我网站上的普通“/blog/”链接。