在 HTML 代码中下载和查找特定字符串答案

【问题标题】：Downloading and finding a particular string within HTML code在 HTML 代码中下载和查找特定字符串
【发布时间】：2021-01-19 09:35:53
【问题描述】：

我有以下代码试图从网页下载 HTML 代码并将该列表中的第二首歌曲打印到 shell 窗口中。

from urllib.request import urlopen

#-----

url1 = 'http://www.itunescharts.net/aus/charts/songs/2020/10/03'


#-----
# Get a link to the web page from the server, using one
# of the URLs above
itunes_page = urlopen(url1)

#-----
# Extract the web page's content as a Unicode string
html_code = itunes_page.read().decode('UTF-8')

#----
# close the connection to the web server
itunes_page.close()

#-----
#finding second song on the chart 
start_marker = '<span class="no">2</span> <span class="artist">'
end_marker = '</span>'
start_position = html_code.find(start_marker)
end_position = html_code.find(end_marker)
if start_position == -1 or end_position == -1:
    print('Error: Unable to Second Artist')
else:
    print('\n' + html_code[start_position + len(start_marker) : end_position].upper())

标记开始和结束的代码：

<li id="chart_aus_songs_2" class="no-move">
<span class="no">2</span>
<span class="artist">Jawsh 685, Jason Derulo & BTS</span> - <span class="entry">

我想知道如何更改我的标记，所以 shell 窗口中的结果是 == "Jawsh 685, Jason Derulo & BTS" 。当我尝试运行代码时，我得到一个空白响应。非常感谢任何帮助！

【问题讨论】：

标签： python beautifulsoup html-parsing

【解决方案1】：

您可以使用BeautifulSoup 库轻松解析您的 HTML 文档，而不是自己搜索标记。

（文档：https://www.crummy.com/software/BeautifulSoup/bs4/doc/#）。

要在 HTML 文档中获取艺术家的姓名，您可以这样做：

from urllib.request import urlopen
from bs4 import BeautifulSoup

#-----

url1 = 'http://www.itunescharts.net/aus/charts/songs/2020/10/03'


#-----
# Get a link to the web page from the server, using one
# of the URLs above
itunes_page = urlopen(url1)

#-----
# Extract the web page's content as a Unicode string
html_code = itunes_page.read().decode('UTF-8')

#----
# close the connection to the web server
itunes_page.close()

# Pass your HTML doc to BeautifulSoup and parse it using 'html.parser'
soup = BeautifulSoup(html_code, 'html.parser')

# Find the HTML element with id = "chart". This is the list of your songs.
chart = soup.find(id="chart")

# The index of the song you want to find. So if you want the 10th song in the list, set song_index = 9
song_index = 1

# Get a list of all <li> elements with class "no-move" in the chart, and get the song_index item from the list
song = chart.find_all("li",class_="no-move")[song_index]

# Find the element containing artist's name in the selected song
artist = song.find("span",class_="artist")

# Get the text of the found artist name element
print(artist.get_text())

您当然可以使用 CSS 选择器简化上述搜索，但首先应该这样做。

【讨论】：

如果没有漂亮的汤，我怎么能做到这一点，因为我无法使用这个插件？我认为我的标记几乎是正确的。只是换行符让我感到厌烦？有什么想法吗？