【问题标题】:How to get the title and url in html page with python如何使用python获取html页面中的标题和url
【发布时间】:2021-06-18 13:59:51
【问题描述】:

我想去department 并且只想选择/打印nameurl。我尝试了以下方法,但我无法理解如何进入department 并选择这两个特定的东西。如何获取所有链接的“名称”和“网址”?

import json
import urllib.request
from bs4 import BeautifulSoup


def getContent():
    # target site url
    url = "www.xyz.com"
    # requesting the url for data
    request = urllib.request.Request(url)
    # get the html, whole page
    htmlpage = urllib.request.urlopen(request).read()
    bsoup = BeautifulSoup(htmlpage, "html.parser")
    # print(bsoup.prettify())

    # main_table = bsoup.find("div",attrs)
    # print(main_table)
    # print(bsoup.find_all('name'))
    # nav = bsoup.nav
    # print(bsoup.title.department.url)
    # for url in find_all('a'):
    # print(url.get('href'))

    for link in bsoup.find_all("a"):
        print("Title: {}".format(link.get("name")))
        print("href: {}".format(link.get("href")))

【问题讨论】:

    标签: python web-scraping beautifulsoup urllib


    【解决方案1】:

    您可以使用json 模块获取name / url,如下所示:

    import json
    import urllib.request
    from bs4 import BeautifulSoup
    
    
    def get_content():
        url = "http://www.ucdenver.edu/pages/ucdwelcomepage.aspx"
        request = urllib.request.Request(url)
        html_page = urllib.request.urlopen(request).read()
        soup = BeautifulSoup(html_page, 'html.parser')
    
        json_data = json.loads(soup.find("script", type="application/ld+json").string)
        for data in json_data["department"]:
            print("{:<60} {}".format(data["name"], data["url"]))
    
    get_content()
    

    输出:

    Center for Undergraduate Exploration and Advising            https://www.ucdenver.edu/center-for-undergraduate-exploration-and-advising
    Commencement                                                 https://www.ucdenver.edu/commencement
    Counseling Center                                            https://www.ucdenver.edu/counseling-center
    First Year Experiences                                       https://www.ucdenver.edu/first-year-experiences
    Health Programs                                              https://www.ucdenver.edu/programs/health-programs
    Housing and Dining                                           https://www.ucdenver.edu/housing-and-dining
    ...
    

    【讨论】:

    • 请问"soup.find("script", type="application/ld+json").string" 是做什么的?为什么我不能直接执行部门名称之类的操作?
    • @Poala 这会在网站上找到 JSON 数据。有关find() 方法,请参阅docs
    猜你喜欢
    • 2010-11-06
    • 1970-01-01
    • 2012-03-02
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多