无法通过 BeautifulSoup 从 0 美元的 DOM 元素中提取内容答案

【问题标题】：Unable to extract content from DOM element with $0 thru BeautifulSoup无法通过 BeautifulSoup 从 0 美元的 DOM 元素中提取内容
【发布时间】：2020-03-05 08:39:29
【问题描述】：

Here is the website I am to scrape the number of reviews

所以在这里我想提取数字 272 但它每次都返回 None 。我必须使用 BeautifulSoup。我试过了-

sources = requests.get('https://www.thebodyshop.com/en-us/body/body-butter/olive-body-butter/p/p000016')

soup = BeautifulSoup(sources.content, 'lxml')

x = soup.find('div', {'class': 'columns five product-info'}).find('div')

print(x)

输出 - 空标签

我想进一步进入该标签。

【问题讨论】：

为什么这么多人认为这个词是“scrap”而不是“scrape”？
HTML 中的class="column five product-info" 在哪里？
$0 和这个有什么关系？
@Barmar 我更正了拼写，谢谢。请再次检查我的链接。
图片应该是HTML的相关部分。

标签： python html dom web-scraping beautifulsoup

【解决方案1】：

评论的数量是从您可以在网络选项卡中找到的 URL 动态检索的。您可以简单地使用正则表达式从 response.text 中提取。端点是定义的 ajax 处理程序的一部分。

你可以在其中一个js文件中找到很多API指令：https://thebodyshop-usa.ugc.bazaarvoice.com/static/6097redes-en_us/bvapi.js

例如：

如果你真的想要，你可以通过大量的 jquery 进行追溯。

tl;博士;我认为您只需将product_id 添加到常量字符串中。

import requests, re
from bs4 import BeautifulSoup as bs

p = re.compile(r'"numReviews":(\d+),')
ids = ['p000627']

with requests.Session() as s:
    for product_id in ids:
        r = s.get(f'https://thebodyshop-usa.ugc.bazaarvoice.com/6097redes-en_us/{product_id}/reviews.djs?format=embeddedhtml')
        p = re.compile(r'"numReviews":(\d+),')
        print(int(p.findall(r.text)[0]))

【讨论】：

嗨 QHarr，感谢您提供此解决方案。它有效，但是您是如何获得 bazaarvoice 网址的。我可以使用 BeautifulSoup 进行动态提取吗？另外，我如何迭代 100 多个 URL 以提取评论计数？
感谢 QHarr，这就是我所做的，为产品 ID 制作了一个 for 循环......在一个常量字符串中
嗨，我正在尝试类似的代码来获取评级，但它返回空列表。 q = re.compile(r'"avgRating":(\d+),') rating.append((q.findall(r.text)[0]))
你试过调试脚本吗？检查正则表达式是否适用于源代码？那个来源包含价值？似乎可能会有不同的处理程序/端点。获取 js 文件并进行 de-minify，然后寻找合适的端点是可能的解决方案的一部分。