如何使用python获取html页面中的标题和url答案

【问题标题】：How to get the title and url in html page with python如何使用python获取html页面中的标题和url
【发布时间】：2021-06-18 13:59:51
【问题描述】：

我想去department 并且只想选择/打印name 和url。我尝试了以下方法，但我无法理解如何进入department 并选择这两个特定的东西。如何获取所有链接的“名称”和“网址”？

import json
import urllib.request
from bs4 import BeautifulSoup


def getContent():
    # target site url
    url = "www.xyz.com"
    # requesting the url for data
    request = urllib.request.Request(url)
    # get the html, whole page
    htmlpage = urllib.request.urlopen(request).read()
    bsoup = BeautifulSoup(htmlpage, "html.parser")
    # print(bsoup.prettify())

    # main_table = bsoup.find("div",attrs)
    # print(main_table)
    # print(bsoup.find_all('name'))
    # nav = bsoup.nav
    # print(bsoup.title.department.url)
    # for url in find_all('a'):
    # print(url.get('href'))

    for link in bsoup.find_all("a"):
        print("Title: {}".format(link.get("name")))
        print("href: {}".format(link.get("href")))

【问题讨论】：

标签： python web-scraping beautifulsoup urllib

【解决方案1】：

您可以使用json 模块获取name / url，如下所示：

import json
import urllib.request
from bs4 import BeautifulSoup


def get_content():
    url = "http://www.ucdenver.edu/pages/ucdwelcomepage.aspx"
    request = urllib.request.Request(url)
    html_page = urllib.request.urlopen(request).read()
    soup = BeautifulSoup(html_page, 'html.parser')

    json_data = json.loads(soup.find("script", type="application/ld+json").string)
    for data in json_data["department"]:
        print("{:<60} {}".format(data["name"], data["url"]))

get_content()

输出：

Center for Undergraduate Exploration and Advising            https://www.ucdenver.edu/center-for-undergraduate-exploration-and-advising
Commencement                                                 https://www.ucdenver.edu/commencement
Counseling Center                                            https://www.ucdenver.edu/counseling-center
First Year Experiences                                       https://www.ucdenver.edu/first-year-experiences
Health Programs                                              https://www.ucdenver.edu/programs/health-programs
Housing and Dining                                           https://www.ucdenver.edu/housing-and-dining
...

【讨论】：

请问"soup.find("script", type="application/ld+json").string" 是做什么的？为什么我不能直接执行部门名称之类的操作？
@Poala 这会在网站上找到 JSON 数据。有关find() 方法，请参阅docs。