【发布时间】:2019-03-20 03:48:32
【问题描述】:
我正在开发一个仅使用请求和 bs4 抓取内部链接的网络爬虫。
我在下面有一个粗略的工作版本,但我不确定如何正确处理检查链接是否以前被抓取过。
import re
import time
import requests
import argparse
from bs4 import BeautifulSoup
internal_links = set()
def crawler(new_link):
html = requests.get(new_link).text
soup = BeautifulSoup(html, "html.parser")
for link in soup.find_all('a', attrs={'href': re.compile("^http://")}):
if "href" in link.attrs:
print(link)
if link.attrs["href"] not in internal_links:
new_link = link.attrs["href"]
print(new_link)
internal_links.add(new_link)
print("All links found so far, ", internal_links)
time.sleep(6)
crawler(new_link)
def main():
parser = argparse.ArgumentParser()
parser.add_argument('url', help='Pass the website url you wish to crawl')
args = parser.parse_args()
url = args.url
#Check full url has been passed otherwise requests will throw error later
try:
crawler(url)
except:
if url[0:4] != 'http':
print('Please try again and pass the full url eg http://example.com')
if __name__ == '__main__':
main()
这些是输出的最后几行:
All links found so far, {'http://quotes.toscrape.com/tableful', 'http://quotes.toscrape.com', 'http://quotes.toscrape.com/js', 'http://quotes.toscrape.com/scroll', 'http://quotes.toscrape.com/login', 'http://books.toscrape.com', 'http://quotes.toscrape.com/'}
<a href="http://quotes.toscrape.com/search.aspx">ViewState</a>
http://quotes.toscrape.com/search.aspx
All links found so far, {'http://quotes.toscrape.com/tableful', 'http://quotes.toscrape.com', 'http://quotes.toscrape.com/js', 'http://quotes.toscrape.com/search.aspx', 'http://quotes.toscrape.com/scroll', 'http://quotes.toscrape.com/login', 'http://books.toscrape.com', 'http://quotes.toscrape.com/'}
<a href="http://quotes.toscrape.com/random">Random</a>
http://quotes.toscrape.com/random
All links found so far, {'http://quotes.toscrape.com/tableful', 'http://quotes.toscrape.com', 'http://quotes.toscrape.com/js', 'http://quotes.toscrape.com/search.aspx', 'http://quotes.toscrape.com/scroll', 'http://quotes.toscrape.com/random', 'http://quotes.toscrape.com/login', 'http://books.toscrape.com', 'http://quotes.toscrape.com/'}
所以它是有效的,但只到了某个点,然后它似乎不再跟随链接。
我确定是因为这条线
for link in soup.find_all('a', attrs={'href': re.compile("^http://")}):
因为那只会找到以 http 开头的链接,而在许多内部页面上,这些链接没有,但是当我这样尝试时
for link in soup.find_all('a')
程序运行非常短暂,然后结束:
http://books.toscrape.com
{'href': 'http://books.toscrape.com'}
http://books.toscrape.com
All links found so far, {'http://books.toscrape.com'}
index.html
{'href': 'index.html'}
index.html
All links found so far, {'index.html', 'http://books.toscrape.com'}
【问题讨论】:
标签: python web-scraping beautifulsoup