如何使用beautifulsoup解析<pre>标签中的数据？答案

【问题标题】：How to parse the data in <pre> tag using beautifulsoup?如何使用beautifulsoup解析<pre>标签中的数据？
【发布时间】：2018-04-20 21:17:53
【问题描述】：

当我试图从以下网站抓取数据时

网址 = https://bedbathandbeyond.ugc.bazaarvoice.com/2009-en_us/1061083288/reviews.djs?format=embeddedhtml&page=4&scrollToTop=true

我从 bedbathbeyond 网站得到这个，如果我使用 request 和 beautifulsoup，我什么也得不到。这是为什么呢？

代码：

r = requests.get(url)
soup = BeautifulSoup(r.content,'lxml')
soup.find_all('span', class_ = 'BVRRReviewAbbreviatedText')

返回值为空：[]

【问题讨论】：

那是因为 HTML 在 AJAX 调用中，所以 BeautifulSoup 将无法解析内容。

标签： python beautifulsoup

【解决方案1】：

我使用了js2py，因为materials 对象包含多个键（BVRRRatingSummarySourceID、BVRRSecondaryRatingSummarySourceID 和BVRRSourceID），如果您需要所有这些，使用正则表达式从其值中获取 HTML 会更加困难。

from bs4 import BeautifulSoup
import js2py
import requests

r = requests.get('https://bedbathandbeyond.ugc.bazaarvoice.com/2009-en_us/1061083288/reviews.djs?format=embeddedhtml')

pattern = (r'var'
           r'\s+'
           r'materials'
           r'\s*=\s*'
           r'{"BVRRRatingSummarySourceID".*}')

js_materials = re.search(pattern, r.text).group()
obj = js2py.eval_js(js_materials).to_dict()
html = obj['BVRRSourceID']
soup = BeautifulSoup(html, 'lxml')
spans = soup.select('span.BVRRReviewAbbreviatedText')

>>> len(spans)
5

在下面的示例中，我只使用了 BVRRSourceID 键下的 HTML，但是您可以通过将值连接在一起来使用整个 HTML：

html = ''.join(obj.values())

如果你想使用lxml解析器，别忘了安装js2py:pip install js2py和pip install lxml。

【讨论】：

即使我对答案的某些部分不太理解，它仍然有效！非常感谢！
你可以阅读正则表达式here。

【解决方案2】：

您可以使用 selenium webdriver 来获取您感兴趣的 html 内容。例如，

from selenium import webdriver


def get_html(url):
    driver = webdriver.Chrome()
    driver.maximize_window()
    driver.get(url)

    time.sleep(5)
    html_content = driver.page_source.strip()
    return html_content

【讨论】：

您好，感谢您的回答。将结果保存到变量后，假设“a=get_html(url)”，然后我尝试使用 Beautifulsoup 解析它：soup = Beautifulsoup(a,'lxml')，然后是 'soup.find_all('span', class= 'BVRRReviewText')，仍然无法检索任何内容。这是为什么呢？