使用 BeautifulSoup 在网页上查找特定文本？答案

【问题标题】：find specific text on a webpage using BeautifulSoup?使用 BeautifulSoup 在网页上查找特定文本？
【发布时间】：2023-03-16 17:21:01
【问题描述】：

我需要从this网页获取漫画的最后页码，这个页面的下拉列表有一个字符串'Last Page(57)'。我想用 Beautiful Soup 找到最后一个页码。

import bs4 as bs
import requests

ref = requests.get('https://readms.net/r/onepunch_man/083/4685/3')
soup = bs.BeautifulSoup(ref.text, 'lxml')

#FIND OUT THE LAST PAGE NUMBER FROM THE SOURCE CODE!!!

print(soup.find_all(string='Last Page')

【问题讨论】：

你有什么错误？
没有错误。这只是不打印任何东西。

标签： python regex python-3.x web-scraping beautifulsoup

【解决方案1】：

使用此代码：

res = soup.find_all("ul",{"class":"dropdown-menu"})[-1].find_all("li")[-1].text
print(res)

输出：

'Last Page (57)'

查找号码使用：

import re
last_page_number = re.findall("\d+",res)
print(last_page_number)

输出：

【讨论】：

【解决方案2】：

使用 bs4 4.7.1，您可以使用 :contains 在innerText 中使用Last Page 获取a 标签

import requests
from bs4 import BeautifulSoup

r  = requests.get('https://readms.net/r/onepunch_man/083/4685/3')
soup = BeautifulSoup(r.content, 'lxml')
last_page = int(soup.select_one('a:contains("Last Page")')['href'].split('/')[-1])

不太健壮：

您可以使用

进行位置匹配

.btn-reader-page li:last-child a

【讨论】：

【解决方案3】：

您不需要使用BeautifulSoup。只需检查 Last Page 项目的页面源：

import re
import requests

r = requests.get('https://readms.net/r/onepunch_man/083/4685/3').text
last_page = re.findall('Last Page \((\d+)\)', r)[0]

输出：

【讨论】：