【发布时间】:2021-07-26 08:07:07
【问题描述】:
我目前正在尝试提取蜘蛛侠归来电影的所有评论,但我只能获得前 25 条评论。我能够在 IMDB 中加载更多以获取所有评论,因为最初它只显示前 25 条,但由于某种原因,我无法在每条评论加载后挖掘所有评论。有谁知道我做错了什么?
下面是我正在运行的代码:
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import pandas as pd
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
from textblob import TextBlob
import time
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
#Set the web browser
driver = webdriver.Chrome(executable_path=r"C:\Users\Kent_\Desktop\WorkStudy\chromedriver.exe")
#Go to Google
driver.get("https://www.imdb.com/title/tt6320628/reviews?ref_=tt_urv")
#Loop load more button
wait = WebDriverWait(driver,10)
while True:
try:
driver.find_element_by_css_selector("button#load-more-trigger").click()
wait.until(EC.invisibility_of_element_located((By.CSS_SELECTOR,".ipl-load-more__load-indicator")))
soup = BeautifulSoup(driver.page_source, 'lxml')
except Exception:break
#Scrape IMBD review
ans = driver.current_url
page = requests.get(ans)
soup = BeautifulSoup(page.content, "html.parser")
all = soup.find(id="main")
#Get the title of the movie
all = soup.find(id="main")
parent = all.find(class_ ="parent")
name = parent.find(itemprop = "name")
url = name.find(itemprop = 'url')
film_title = url.get_text()
print('Pass finding phase.....')
#Get the title of the review
title_rev = all.select(".title")
title = [t.get_text().replace("\n", "") for t in title_rev]
print('getting title of reviews and saving into a list')
#Get the review
review_rev = all.select(".content .text")
review = [r.get_text() for r in review_rev]
print('getting content of reviews and saving into a list')
#Make it into dataframe
table_review = pd.DataFrame({
"Title" : title,
"Review" : review
})
table_review.to_csv('Spiderman_Reviews.csv')
print(title)
print(review)
【问题讨论】:
-
在给定时间是否只有视口中的那些?如果您在浏览器查找框的元素选项卡中运行选择器,在确保所有结果都存在后,给出的匹配计数是多少?
-
在 2130 处给出的匹配计数似乎是正确的。可能是什么问题?
标签: python beautifulsoup web-crawler imdb