如何爬取IMDB？ [阅读更多] 按钮未按下答案

【问题标题】：How to crawl IMDB? [Read more] button not being pressed如何爬取IMDB？ [阅读更多] 按钮未按下
【发布时间】：2019-06-03 05:02:09
【问题描述】：

我正在提取 IMDB 电影评论。

有问题为了调出影评，必须按下 [阅读更多] 按钮。

但是审核结束后，我不知道如何结束。

目前正在以“轮询”方式处理。你怎样才能更聪明地处理这个问题？

当有更多要阅读时：

当没有什么要读的时候：

谢谢！

【问题讨论】：

在这些任务中使用硒
stackoverflow.com/questions/1966503/…

标签： python beautifulsoup web-crawler

【解决方案1】：

如果你是用 Python 做的，你可以使用 xpath 从 html 页面中提取 xpath 检索评论的例子如下。您可以使用 try except case 这样如果页面中没有信息，则循环将结束。看看下面的例子，它可能会帮助你 - -

reviews = driver.find_elements_by_xpath('//article[@itemprop = "review"]')
            for review in reviews:

                # Initialize an empty dictionary for each review
                review_dict = {}

                # Find xpaths of the fields desired as columns in future data frame
                # We use the try/except statements to account for the fact that the reviews are not required to have
                # all the fields listed below, and if a review does not have a certain field we wish to make the
                # corresponding field blank in that particular row, rather than quit upon receiving an error.
                try:
                    airline = review.find_element_by_xpath(
                        '//div[@class = "review-heading"]//h1[@itemprop = "name"]').text
                except:
                    airline = page
                try:
                    overall = review.find_element_by_xpath('.//span[@itemprop = "ratingValue"]').text
                except:
                    overall = ""

同样，您可以在 IMDB 案例中使用 xpath 元素并使用 try except，这样如果没有可读取的内容，则不会弹出错误。

【讨论】：

谢谢！但是。作为一个限制，我想我应该给一点 [Time.sleep] ，反正谢谢
是的，你应该给。否则您可能会错过信息。