如何使用 BeautifulSoup 获取所有 IMDB 用户对电影的评论答案

【问题标题】：How can I use BeautifulSoup to get all IMDB user reviews of a movie如何使用 BeautifulSoup 获取所有 IMDB 用户对电影的评论
【发布时间】：2019-10-09 04:59:15
【问题描述】：

我正在开展一个学校项目，并希望获得所有用户对 IMDB 超级英雄电影的评论。

首先，我尝试仅获取一部电影的所有用户评论。

用户评论页面，由 25 条用户评论和一个“加载更多”按钮组成。虽然我已经设法编写代码来打开加载更多按钮。我陷入了第二部分：将所有用户评论放在一个列表中。

我已经尝试使用 BeautifulSoup 来查找页面上的所有“内容”部分。但是，我的清单仍然是空的。

from bs4 import BeautifulSoup
testurl = "https://www.imdb.com/title/tt0357277/reviews?ref_=tt_urv"
patience_time1 = 60
XPATH_loadmore = "//*[@id='load-more-trigger']"
XPATH_grade = "//*[@class='review-container']/div[1]"
list_grades = []

driver = webdriver.Firefox()
driver.get(testurl)

# This is the part in which I open all 'load more' buttons.
while True:
    try:
        loadmore = driver.find_element_by_id("load-more-trigger")
        time.sleep(2)
        loadmore.click()
        time.sleep(5)
    except Exception as e:
        print(e)
        break
    print("Complete")
    time.sleep(10)

    # When the whole page is loaded, I want to get all 'content' parts.
    soup = BeautifulSoup(driver.page_source)
    content = soup.findAll("content")
    list_content = [c.text_content() for c in content]

driver.quit()

我希望获得网站上所有评论容器内容的列表。但是，我的列表仍然是空的。

【问题讨论】：

你有没有看一下当你点击加载更多时会发生什么请求？相反，复制请求可能更容易。
我在本地运行您的代码时看到name 'webdriver' is not defined。你能提供一个requirements.txt吗？
@Jeff Xiao 我导入了以下模块： from selenium import webdriver from selenium.webdriver.common.keys import Keys from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC from selenium.webdriver.common.by import By from selenium.common.exceptions import NoSuchElementException import time
@Marieke 我已经添加了答案。另一个注意事项是您可能需要调整睡眠时间，目前它在我的机器上过长。
stackoverflow.com/search?q=imdb+review

标签： python selenium web-scraping beautifulsoup findall

【解决方案1】：

您使用 BeautifulSoup4，对吗？

方法名称从 3 更改为 4。(document)

此外，find_all 采用 标签名称，以及 css 类的可选 class_ 参数（请参阅此 SO answer）

所以您的代码应该使用新名称：

    # content = soup.findAll("content")
    content = soup.find_all('div', class_=['text','show-more__control'])

在你的列表理解中也使用get_text()：

# list_content = [c.text_content() for c in content]
list_content = [tag.get_text() for tag in content]

最后，在获取汤时提供一个解析器：(document)

    soup = BeautifulSoup(driver.page_source, features="html.parser")

否则你会遇到这个UserWarning：

SO56261323.py:36：用户警告：没有明确指定解析器，所以我正在为这个系统使用最好的 HTML 解析器（“html.parser”）。这通常不是问题，但如果你运行它代码在另一个系统上，或者在不同的虚拟环境中，它可能使用不同的解析器并表现不同。

【讨论】：