【问题标题】:Downloading and finding a particular string within HTML code在 HTML 代码中下载和查找特定字符串
【发布时间】:2021-01-19 09:35:53
【问题描述】:

我有以下代码试图从网页下载 HTML 代码并将该列表中的第二首歌曲打印到 shell 窗口中。

from urllib.request import urlopen

#-----

url1 = 'http://www.itunescharts.net/aus/charts/songs/2020/10/03'


#-----
# Get a link to the web page from the server, using one
# of the URLs above
itunes_page = urlopen(url1)

#-----
# Extract the web page's content as a Unicode string
html_code = itunes_page.read().decode('UTF-8')

#----
# close the connection to the web server
itunes_page.close()

#-----
#finding second song on the chart 
start_marker = '<span class="no">2</span> <span class="artist">'
end_marker = '</span>'
start_position = html_code.find(start_marker)
end_position = html_code.find(end_marker)
if start_position == -1 or end_position == -1:
    print('Error: Unable to Second Artist')
else:
    print('\n' + html_code[start_position + len(start_marker) : end_position].upper()) 

标记开始和结束的代码:

<li id="chart_aus_songs_2" class="no-move">
<span class="no">2</span>
<span class="artist">Jawsh 685, Jason Derulo & BTS</span> - <span class="entry">

我想知道如何更改我的标记,所以 shell 窗口中的结果是 == "Jawsh 685, Jason Derulo & BTS" 。当我尝试运行代码时,我得到一个空白响应。非常感谢任何帮助!

【问题讨论】:

    标签: python beautifulsoup html-parsing


    【解决方案1】:

    您可以使用BeautifulSoup 库轻松解析您的 HTML 文档,而不是自己搜索标记。

    (文档:https://www.crummy.com/software/BeautifulSoup/bs4/doc/#)。

    要在 HTML 文档中获取艺术家的姓名,您可以这样做:

    from urllib.request import urlopen
    from bs4 import BeautifulSoup
    
    #-----
    
    url1 = 'http://www.itunescharts.net/aus/charts/songs/2020/10/03'
    
    
    #-----
    # Get a link to the web page from the server, using one
    # of the URLs above
    itunes_page = urlopen(url1)
    
    #-----
    # Extract the web page's content as a Unicode string
    html_code = itunes_page.read().decode('UTF-8')
    
    #----
    # close the connection to the web server
    itunes_page.close()
    
    # Pass your HTML doc to BeautifulSoup and parse it using 'html.parser'
    soup = BeautifulSoup(html_code, 'html.parser')
    
    # Find the HTML element with id = "chart". This is the list of your songs.
    chart = soup.find(id="chart")
    
    # The index of the song you want to find. So if you want the 10th song in the list, set song_index = 9
    song_index = 1
    
    # Get a list of all <li> elements with class "no-move" in the chart, and get the song_index item from the list
    song = chart.find_all("li",class_="no-move")[song_index]
    
    # Find the element containing artist's name in the selected song
    artist = song.find("span",class_="artist")
    
    # Get the text of the found artist name element
    print(artist.get_text())
    

    您当然可以使用 CSS 选择器简化上述搜索,但首先应该这样做。

    【讨论】:

    • 如果没有漂亮的汤,我怎么能做到这一点,因为我无法使用这个插件?我认为我的标记几乎是正确的。只是换行符让我感到厌烦?有什么想法吗?
    猜你喜欢
    • 1970-01-01
    • 2011-03-24
    • 2015-06-17
    • 1970-01-01
    • 2017-08-14
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多