【问题标题】:Data Mining IMDB Reviews - Only extracting the first 25 reviews数据挖掘 IMDB 评论 - 仅提取前 25 条评论
【发布时间】:2021-07-26 08:07:07
【问题描述】:

我目前正在尝试提取蜘蛛侠归来电影的所有评论,但我只能获得前 25 条评论。我能够在 IMDB 中加载更多以获取所有评论,因为最初它只显示前 25 条,但由于某种原因,我无法在每条评论加载后挖掘所有评论。有谁知道我做错了什么?

下面是我正在运行的代码:

import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import pandas as pd
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
from textblob import TextBlob
import time
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By


#Set the web browser
driver = webdriver.Chrome(executable_path=r"C:\Users\Kent_\Desktop\WorkStudy\chromedriver.exe")

#Go to Google
driver.get("https://www.imdb.com/title/tt6320628/reviews?ref_=tt_urv")

#Loop load more button
wait = WebDriverWait(driver,10)
while True:
    try:
        driver.find_element_by_css_selector("button#load-more-trigger").click()
        wait.until(EC.invisibility_of_element_located((By.CSS_SELECTOR,".ipl-load-more__load-indicator")))
        soup = BeautifulSoup(driver.page_source, 'lxml')
    except Exception:break


#Scrape IMBD review
ans = driver.current_url
page = requests.get(ans)
soup = BeautifulSoup(page.content, "html.parser")
all = soup.find(id="main")

#Get the title of the movie
all = soup.find(id="main")
parent = all.find(class_ ="parent")
name = parent.find(itemprop = "name")
url = name.find(itemprop = 'url')
film_title = url.get_text()
print('Pass finding phase.....')

#Get the title of the review
title_rev = all.select(".title")
title = [t.get_text().replace("\n", "") for t in title_rev]
print('getting title of reviews and saving into a list')

#Get the review
review_rev = all.select(".content .text")
review = [r.get_text() for r in review_rev]
print('getting content of reviews and saving into a list')

#Make it into dataframe
table_review = pd.DataFrame({
    "Title" : title,
    "Review" : review
})
table_review.to_csv('Spiderman_Reviews.csv')

print(title)
print(review)

【问题讨论】:

  • 在给定时间是否只有视口中的那些?如果您在浏览器查找框的元素选项卡中运行选择器,在确保所有结果都存在后,给出的匹配计数是多少?
  • 在 2130 处给出的匹配计数似乎是正确的。可能是什么问题?

标签: python beautifulsoup web-crawler imdb


【解决方案1】:

嗯,实际上,没有必要使用Selenium。数据可通过向网站 API 发送GET 请求获得,格式如下:

https://www.imdb.com/title/tt6320628/reviews/_ajax?ref_=undefined&paginationKey=MY-KEY

您必须在网址中为paginationKey 提供key (...&paginationKey=MY-KEY)

key 可以在 load-more-data 类中找到:

<div class="load-more-data" data-key="g4wp7crmqizdeyyf72ux5nrurdsmqhjjtzpwzouokkd2gbzgpnt6uc23o4zvtmzlb4d46f2swblzkwbgicjmquogo5tx2">
            </div>

所以,要将所有评论刮到DataFrame,请尝试:

import pandas as pd
import requests
from bs4 import BeautifulSoup


url = (
    "https://www.imdb.com/title/tt6320628/reviews/_ajax?ref_=undefined&paginationKey={}"
)
key = ""
data = {"title": [], "review": []}

while True:
    response = requests.get(url.format(key))
    soup = BeautifulSoup(response.content, "html.parser")
    # Find the pagination key
    pagination_key = soup.find("div", class_="load-more-data")
    if not pagination_key:
        break

    # Update the `key` variable in-order to scrape more reviews
    key = pagination_key["data-key"]
    for title, review in zip(
        soup.find_all(class_="title"), soup.find_all(class_="text show-more__control")
    ):
        data["title"].append(title.get_text(strip=True))
        data["review"].append(review.get_text())

df = pd.DataFrame(data)
print(df)

输出(截断):

                                                title                                             review
0                              Terrific entertainment  Spiderman: Far from Home is not intended to be...
1         THe illusion of the identity of Spider man.  Great story in continuation of spider man home...
2                       What Happened to the Bad Guys  I believe that Quinten Beck/Mysterio got what ...
3                                         Spectacular  One of the best if not the best Spider-Man mov...

...
...

【讨论】:

  • 这很有帮助,但是要进行哪些更改才能获得有限数量的评论?例如。如果有 2000 条评论,但我只想提取 500 条。代码会发生什么变化
  • @MalviPatel 您可以使用range() 代替while True
猜你喜欢
  • 2014-09-10
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2020-05-15
  • 2012-03-30
  • 1970-01-01
  • 1970-01-01
相关资源
最近更新 更多