Python解析网页链接计数器答案

【问题标题】：Python parse webpage links counterPython解析网页链接计数器
【发布时间】：2019-01-08 21:11:36
【问题描述】：

我只是使用下面的代码解析来自 url 的链接。找到了链接，但是我的计数器不起作用。请对如何修复我的柜台有任何想法？谢谢

def parse_all_links(html):

links =  re.findall(r"""a href=(['"].*['"])""", html)#find links starting with href
print("found the following links addresses: ".format(len(html)))#print a message before the output

if len(links) ==0:
    print("Sorry, no links found")
else:
    count = 1#this count how many links are displayed
    for e in links:
        print(e)
        count += 1

print('--------------')

【问题讨论】：

您能否澄清一下“我的计数器不起作用”的意思？
您好，我运行代码时会显示实际链接，但我看不到链接总数
我的回答解决了您的问题吗？
是的，非常感谢 mrangry777 :)

标签： python parsing url hyperlink counter

【解决方案1】：

您可能希望使用 len() 函数来获取链接列表的长度，并使用像 Beautiful Soup 这样的专用解析库来解析 HTML，因为它可以像冠军一样处理格式错误或格式错误的 HTML。

#encoding: utf-8
import re
from bs4 import BeautifulSoup
#example HTML
html = """
  <html>
    <head>
      <title>Link page</title>
    </head>
    <body>
      <a href="https://www.google.com" class="link">Google</a>
      <a href="https://www.yahoo.com" class="link">Yahoo</a>
      <a href="https://www.stackoverflow.com" class="link">Stackoverflow</a>
    </body>
  </html>
"""

parsed_html = BeautifulSoup(html, "lxml")

links = [a["href"] for a in parsed_html.find_all("a")]

if len(links) ==0:
    print("Sorry, no links found")
else:
    count = len(links)
    for e in links:
        print(e)
    #print the total amount of links
    print(count, "links in total")
print('--------------')

【讨论】：

@mrangry777 我看不到这样的事情。你能澄清一下吗？
这一行print count, "links in total"是python 2语法。这可能不是错误的，但应该注意

【解决方案2】：

我不完全理解您的问题，但您的代码存在一些小问题。如果这有帮助，请告诉我：

import re
import requests
def parse_all_links(html):
    links = re.findall(r"""a href=(['"].*['"])""", html)  # find links starting with href
    print("found the following links addresses: ".format(len(html)))  # print a message before the output

    if len(links) == 0:
        print("Sorry, no links found")
    else:
        count = 0  # this count how many links are displayed
        for e in links:
            print(e)
            count += 1

    print('--------------\nCount:{}'.format(count))


parse_all_links(requests.get("http://www.onet.pl").text)

我测试了该解决方案并且它有效。示例输出：

...
"https://zapytaj.onet.pl/Zadania/testy/index.html"
"https://zapytaj.onet.pl/quizy/index.html"
"https://zapytaj.onet.pl/Category/005/1,Biznes_i_Finanse.html"
"https://zapytaj.onet.pl/Category/029/1,Gry.html"
"https://zapytaj.onet.pl/Category/028/1,Hobby.html"
"https://zapytaj.onet.pl/Category/021/1,Dla_Doroslych.html"
"https://zapytaj.onet.pl/Category/009/1,Dom_i_Ogrod.html"
"https://zapytaj.onet.pl/Category/016/1,Jedzenie_i_Napoje.html"
"http://zapytaj.onet.pl"
"https://polityka-prywatnosci.onet.pl/"
"http://reklama.onet.pl/"
"http://ofirmie.onet.pl/0,0,0,PL,aktualne_ogloszenia,oferta.html"
"http://onettechnologie.pl/"
"http://www.dreamlab.pl/"
--------------
Count:319

【讨论】：

在您的代码中，请让计数器 count=0，以便它为您提供网页链接的准确计数
谢谢，我不知道为什么我忽略了它