我自己写了一个爬虫。我希望这会对你有所帮助。基本上我所做的是将网址添加到 /2/2/3/index.php 之类的网站,这将使该网站成为 http://www.website.com/2/2/3/index.php 。然后我将所有网站插入到一个数组中,检查我之前是否访问过这个网站,如果我访问过,它就不会访问那里。此外,如果此站点中有一些不相关的网站,例如指向 youtube 视频的链接,则它也不会抓取 youtube 或任何其他与“网站无关”的网站。
对于您的问题,我建议您将所有访问过的网站放入一个数组中,并使用 for 循环检查该数组。如果 URL 与数组相同,则打印它。
我不确定这是否是您想要的,但至少我尝试过。我没有使用 BeautifulSoup,它仍然有效,因此请考虑将该模块放在一边。
我的脚本(更像是它的一部分。我也得到了异常检查,所以不要惊慌):
__author__ = "Sploit"
# This part is about import the default python modules and the modules that the user have to download
# If the module does not exist, the script asks him to install that specific module
import os # This module provides a portable way of using operating system dependent functionality
import urllib # The urllib module provides a simple interface for network resource access
import urllib2 # The urllib2 module provides a simple interface for network resource access
import time # This module provides various time-related functions
import urlparse # This module defines a standard interface to break URL strings up in components
# to combine the components back into a URL string, and to convert a relative URL to an absolute URL given a base URL.
import mechanize
print ("Which website would you like to crawl?")
website_url = raw_input("--> ")
# Ads http:// to the given URL because it is the only way to check for server response
# If the user will add to the URL directions then they will be deleted
# Example: 'https://moz.com/learn/seo/external-link' will turn to 'https://moz.com/'
if website_url.split('//')[0] != 'http:' and website_url.split('//')[0] != 'https:':
website_url = 'http://' + website_url
website_url = website_url.split('/')[0] + '//' + website_url.split('/')[2]
# The user will stuck in a loop until a valid website is exist, using the application layer of the OSI module, HTTP Protocol
while True:
try:
if urllib2.urlopen(website_url).getcode() != 200:
print ("Invalid URL given. Which website would you like to crawl?")
website_url = raw_input("--> ")
else:
break
except:
print ("Invalid URL given. Which website would you like to crawl?")
website_url = raw_input("--> ")
# This part is the actual the Web Crawler
# What it does is to search for links
# All the URLs that are not the websites URLs are printed in a txt file named "Non website links"
fake_browser = mechanize.Browser() # Set the starting point for the spider and initialize the a mechanize browser object
urls = [website_url] # Create lists for the URLs that the script should go through
visited = [website_url] # Create lists that we have visited in, to avoid multiplies
text_file = open("Non website links.txt", "w") # We create a txt file for all the URLs that are not the websites URLs
text_file_url = open("Website links.txt", "w") # We create a txt file for all the URLs that are the websites URLs
print ("Crawling : " + website_url)
print ("The crawler started at " + time.asctime(time.localtime()) + ". This may take a couple of minutes") # To let the user know when the crawler started to work
# Since the amount of urls in the list is dynamic we just let the spider go until some last url didn't have new ones on the website
while len(urls) > 0:
try:
fake_browser.open(urls[0])
urls.pop(0)
for link in fake_browser.links(): # A loop which looking for all the images in the website
new_website_url = urlparse.urljoin(link.base_url, link.url) # Create a new url with the websites link that is acceptable as HTTP
if new_website_url not in visited and website_url in new_website_url: # If we have been in this website, don't enter the URL to the list, to avoid multiplies
visited.append(new_website_url)
urls.append(new_website_url)
print ("Found: " + new_website_url) # Print all the links that the crawler found
text_file_url.write(new_website_url + '\n') # Print the non-website URL to the txt file
elif new_website_url not in visited and website_url not in new_website_url:
visited.append(new_website_url)
text_file.write(new_website_url + '\n') # Print the non-website URL to the txt file
except:
print ("Link couldn't be opened")
urls.pop(0)
text_file.close() # Close the txt file, to prevent anymore writing to it
text_file_url.close() # Close the txt file, to prevent anymore writing to it
print ("A txt file with all the website links has been created in your folder")
print ("Finished!!")