我的脚本没有搜索所有链接，该怎么办？答案

【问题标题】：My script does not search all links, what to do?我的脚本没有搜索所有链接，该怎么办？
【发布时间】：2020-07-10 18:09:50
【问题描述】：

我正在构建一个脚本来扫描网站并捕获 URL 并测试它是否正常工作。问题是该脚本只查找网站主页的 URL，而将其他 URL 放在一边。如何捕获链接到该网站的所有页面？

在我的代码附件下面：

import urllib
from bs4 import BeautifulSoup
import re
from urllib.request import Request, urlopen
from urllib.error import URLError, HTTPError


page = urllib.request.urlopen("http://www.google.com/")
soup = BeautifulSoup(page.read(), features='lxml')
links = soup.findAll("a", attrs={'href': re.compile('^(http://)')})
for link in links:

    result = (link["href"])

    req = Request(result)

    try:
        response = urlopen(req)
        pass

    except HTTPError as e:

        if e.code != 200:
            # Stop, Error!
            with open("Document_ERROR.txt", 'a') as archive:
               archive.write(result)
               archive.write('\n')
               archive.write('{} \n'.format(e.reason))
               archive.write('{}'.format(e.code))
               archive.close()
        
        else:
        # Enjoy!
            with open("Document_OK.txt", 'a') as archive:
               archive.write(result)
               archive.write('\n')
               archive.close()

【问题讨论】：

不探索外部和内部什么意思？
链接到页面的那些链接。内部是网站
我还是不明白，你想做什么？
建议是在网站上查找损坏的链接。在搜索中，它只找到链接到主页的 URL，而不扫描其他站点。
我在代码中看不到任何可以做到这一点的东西。不过，我可能会遗漏一些东西。

标签： python html web beautifulsoup

【解决方案1】：

这不起作用的主要原因是您将 OK 和 ERROR-writes 都放在了 except-block 中。

这意味着只有实际引发异常的 url 才会被存储。

一般来说，我建议您将一些打印语句喷射到脚本的不同阶段 - 或使用允许您在运行时逐行执行代码的 IDE。这使得这样的东西更容易调试。

PyCharm 是免费的，您可以这样做。试试看吧。

所以 - 我没有使用 urllib，但经常使用请求（python -m pip install requests）。使用它的快速重构如下所示：

import requests
from bs4 import BeautifulSoup 
import re

url = "http://www.google.com"
r = requests.get(url)
html = r.text
soup = BeautifulSoup(html, "lxml")

links = soup.find_all("a", attrs={'href': re.compile('^(http://)')}) 

for link in links: 
    href = link["href"]
    print("Testing for URL {}".format(href))
    
    try:
        # since you only want to scan for status code, no need to pull the entire html of the site - use HEAD instead of GET
        r = requests.head(href)
        status = r.status_code
        # 404 etc will not yield an error
        error = None
    except Exception as e:
        # these exception will not have a status_code
        status = None
        error = e
    
    # store the finding in your files
    if status is None or status != 200:
        print("URL is broken. Writing to ERROR_Doc")
        # do your storing here of href, status and error
    else:
        print("URL is live. Writing to OK_Doc"
        # do your storing here

希望这是有道理的。

【讨论】：