Beautiful Soup 不能在第一个 div 标签之后刮掉答案

【问题标题】：Beautiful Soup cannot scrape after the first div tagBeautiful Soup 不能在第一个 div 标签之后刮掉
【发布时间】：2021-01-23 17:07:46
【问题描述】：

请见下文。我想搜索位于

中的餐厅名称大力水手

请参阅下图了解本网站的 HTML。

谁能告诉我如何使用 Beautiful Soup 或任何其他网络抓取包在 Python 上抓取餐厅名称“Popeyes”？

提前致谢！

下面是我用来抓取数据的代码，但是，它停在了那里，我无法继续前进。 ''' from bs4 import BeautifulSoup as soup # HTML 数据结构 from urllib.request import urlopen as uReq # Web 客户端

# URl to web scrape from.
# in this example we web scrape graphics cards from Newegg.com
page_url = "https://www.doordash.com/store/popeyes-toronto-254846/en-CA"

# opens the connection and downloads html page from url
uClient = uReq(page_url)

# parses html into a soup data structure to traverse html
# as if it were a json data type.
page_soup = soup(uClient.read(), "html.parser")
uClient.close()

page_soup.div'''

【问题讨论】：

请以文本而不是图像的形式提供代码。
嗨，M Z，感谢您的回复。当我复制它时，HTMl 似乎很长，所以我认为直接向您发送链接可能更有效。下面是我试图浏览的链接，我现在只想要餐厅名称，最后是餐厅的评级和类型。 doordash.com/store/popeyes-toronto-254846/en-CA
SO 不是编码服务。您必须先尝试一下，如果遇到困难，请发布您的代码并告诉我们您的问题出在哪里。见How do I ask a good question?

标签： python web-scraping beautifulsoup web-crawler

【解决方案1】：

你可以试试这个（我可能会弄错类名）：

import urllib.request
import bs4 as bs
from bs4 import BeautifulSoup

url_1 = 'https://www.doordash.com/store/popeyes-toronto-254846/en-CA'
sauce_1  = urllib.request.urlopen(url_1).read()
soup_1 = bs.BeautifulSoup(sauce_1, 'lxml')     

for x in (soup_1.find_all('h1', class_ = 'sc-AnqlK keKZVr sc-jFpLkX bsGprJ')):
   print(x)

如果这有帮助，请告诉我！

【讨论】：

您好，Matteo，感谢您回答我的问题。但是，我不确定发生了什么，代码似乎没有输出。我正在做一些研究，我认为这可能是网站上的某种动态内容。 Selenium 显然可以抓取它，但我对网络抓取真的很陌生，所以这有点太高级了。

【解决方案2】：

您可以通过指定“div”类来获取名称。

from bs4 import BeautifulSoup
import requests

headers = {
     "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36"
     }

response = requests.get(url, headers = headers)
soup = BeautifulSoup(response.content, 'html.parser')
soup.encode('utf-8')

title = soup.find(class_ = 'sc-AnqlK keKZVr sc-jFpLkX bsGprJ').get_text()

print(title)

我不知道类名是否写对了，但你可以复制粘贴。

【讨论】：