【问题标题】:Check if an URL is relative to another (ie. they are on the same host)检查一个 URL 是否相对于另一个(即它们在同一主机上)
【发布时间】:2015-08-31 15:02:46
【问题描述】:

我有一个基本的 HTTP URL 和一个其他 HTTP URL 的列表。我正在编写一个简单的爬虫/链接检查器作为一项研究(因此,不需要建议预先编写的工具),它检查基本 URL 是否有任何损坏的链接,并递归地爬取所有其他“内部”页面(即。从同一站点内的基本 URL 链接的页面)具有相同的意图。最后,我必须输出链接列表及其状态(外部/内部,以及每个实际上是内部但显示为绝对 URL 的链接的警告。

到目前为止,我 succeeded 检查所有链接并使用请求和 BeautifulSoup 库进行爬网,但我找不到已经编写好的方法来检查两个绝对 URL 是否指向同一个站点(除了拆分 URL沿着斜线,这对我来说似乎很难看)。有这方面的知名图书馆吗?

【问题讨论】:

标签: python url


【解决方案1】:

最后我选择了urlparse(感谢@padraic-cunningham 为我指出了它)。在代码的开头,我解析了“基本 URL”(即我开始抓取的那个):

base_parts = urlparse.urlparse(base_url)

然后对于我找到的每个链接(例如 for a in soup.find_all('a'):

link_parts = urlparse.urlparse(a.get('href'))

此时我必须比较 URL 方案(我认为指向同一站点的链接具有不同的 URL 方案,http 或 https,是不同的;我将来可能会将此比较设为可选):

internal = base_parts.scheme == link_parts.scheme \
           and base_parts.netloc == link_parts.netloc

到这里,如果链接指向与我的基本 URL 相同的服务器(具有相同的方案),则内部将为 True。您可以查看最终结果here

【讨论】:

    【解决方案2】:

    我自己写了一个爬虫。我希望这会对你有所帮助。基本上我所做的是将网址添加到 /2/2/3/index.php 之类的网站,这将使该网站成为 http://www.website.com/2/2/3/index.php 。然后我将所有网站插入到一个数组中,检查我之前是否访问过这个网站,如果我访问过,它就不会访问那里。此外,如果此站点中有一些不相关的网站,例如指向 youtube 视频的链接,则它也不会抓取 youtube 或任何其他与“网站无关”的网站。

    对于您的问题,我建议您将所有访问过的网站放入一个数组中,并使用 for 循环检查该数组。如果 URL 与数组相同,则打印它。

    我不确定这是否是您想要的,但至少我尝试过。我没有使用 BeautifulSoup,它仍然有效,因此请考虑将该模块放在一边。

    我的脚本(更像是它的一部分。我也得到了异常检查,所以不要惊慌):

    __author__ = "Sploit"
    
    
    # This part is about import the default python modules and the modules that the user have to download
    # If the module does not exist, the script asks him to install that specific module
    
    import os  # This module provides a portable way of using operating system dependent functionality
    import urllib  # The urllib module provides a simple interface for network resource access
    import urllib2  # The urllib2 module provides a simple interface for network resource access
    import time  # This module provides various time-related functions
    import urlparse  # This module defines a standard interface to break URL strings up in components
                     # to combine the components back into a URL string, and to convert a relative URL to an absolute URL given a base URL.
    import mechanize
    
    print ("Which website would you like to crawl?")
    website_url = raw_input("--> ")
    
    # Ads http:// to the given URL because it is the only way to check for server response
    # If the user will add to the URL directions then they will be deleted
    # Example: 'https://moz.com/learn/seo/external-link' will turn to 'https://moz.com/'
    if website_url.split('//')[0] != 'http:' and website_url.split('//')[0] != 'https:':
        website_url = 'http://' + website_url
    website_url = website_url.split('/')[0] + '//' + website_url.split('/')[2]
    
    # The user will stuck in a loop until a valid website is exist, using the application layer of the OSI module, HTTP Protocol
    while True:
        try:
            if urllib2.urlopen(website_url).getcode() != 200:
                print ("Invalid URL given. Which website would you like to crawl?")
                website_url = raw_input("--> ")
            else:
                break
        except:
            print ("Invalid URL given. Which website would you like to crawl?")
            website_url = raw_input("--> ")
    
    # This part is the actual the Web Crawler
    # What it does is to search for links
    # All the URLs that are not the websites URLs are printed in a txt file named "Non website links"
    
    
    fake_browser = mechanize.Browser()  # Set the starting point for the spider and initialize the a mechanize browser object
    urls = [website_url]  # Create lists for the URLs that the script should go through
    visited = [website_url]  # Create lists that we have visited in, to avoid multiplies
    text_file = open("Non website links.txt", "w")  # We create a txt file for all the URLs that are not the websites URLs
    text_file_url = open("Website links.txt", "w")  # We create a txt file for all the URLs that are the websites URLs
    
    print ("Crawling : " + website_url)
    print ("The crawler started at " + time.asctime(time.localtime()) + ". This may take a couple of minutes")  # To let the user know when the crawler started to work
    # Since the amount of urls in the list is dynamic we just let the spider go until some last url didn't have new ones on the website
    while len(urls) > 0:
        try:
            fake_browser.open(urls[0])
            urls.pop(0)
            for link in fake_browser.links():  # A loop which looking for all the images in the website
                new_website_url = urlparse.urljoin(link.base_url, link.url)  # Create a new url with the websites link that is acceptable as HTTP
                if new_website_url not in visited and website_url in new_website_url:  # If we have been in this website, don't enter the URL to the list, to avoid multiplies
                    visited.append(new_website_url)
                    urls.append(new_website_url)
                    print ("Found: " + new_website_url)  # Print all the links that the crawler found
                    text_file_url.write(new_website_url + '\n')  # Print the non-website URL to the txt file
                elif new_website_url not in visited and website_url not in new_website_url:
                    visited.append(new_website_url)
                    text_file.write(new_website_url + '\n')  # Print the non-website URL to the txt file
        except:
            print ("Link couldn't be opened")
            urls.pop(0)
    
    text_file.close()  # Close the txt file, to prevent anymore writing to it
    text_file_url.close()  # Close the txt file, to prevent anymore writing to it
    print ("A txt file with all the website links has been created in your folder")
    print ("Finished!!")
    

    【讨论】:

    • 这在某种程度上我已经做过了,除了你使用 mechanize 而不是 request+beautifulsoup(我也会考虑)。 Wy 现在的目标是检查链接是否是内部绝对投注; IE。一个绝对的“gergely.polonkai.eu/blog”链接,而不是我网站上的普通“/blog/”链接。
    猜你喜欢
    • 2011-04-19
    • 1970-01-01
    • 1970-01-01
    • 2019-11-26
    • 1970-01-01
    • 2015-07-09
    • 2016-08-16
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多