【发布时间】:2019-08-25 04:11:28
【问题描述】:
我想为每部电影提取至少 20 条用户评论,但我不知道如何循环进入 IMDb 标题电影,然后使用 beautifulsoup 获取用户评论。
title_link(1) = "https://www.imdb.com/title/tt7131622/?ref_=adv_li_tt";
user_reviews_link_movie1 = "https://www.imdb.com/title/tt7131622/reviews?ref_=tt_ov_rt" ;
我能够从静态页面中提取列表中每部电影的标题、年份、评分和元分数。
# Import packages and set urls
from requests import get
url = 'https://www.imdb.com/search/title/?title_type=feature,tv_movie&release_date=2018-01-01,2019-12-31&count=250'
response = get(url)
print(response.text[:500])
from bs4 import BeautifulSoup
html_soup = BeautifulSoup(response.text, 'html.parser')
type(html_soup)
movie_containers = html_soup.find_all('div', class_ = 'lister-item mode-advanced')
print(type(movie_containers))
print(len(movie_containers))
# Lists to store the scraped data in
names = []
years = []
imdb_ratings = []
metascores = []
votes = []
# Extract data from individual movie container
for container in movie_containers:
# If the movie has Metascore, then extract:
if container.find('div', class_ = 'ratings-metascore') is not None:
# The name
name = container.h3.a.text
names.append(name)
# The year
year = container.h3.find('span', class_ = 'lister-item-year').text
years.append(year)
# The IMDB rating
imdb = float(container.strong.text)
imdb_ratings.append(imdb)
# The Metascore
m_score = container.find('span', class_ = 'metascore').text
metascores.append(int(m_score))
import pandas as pd
test_df = pd.DataFrame({'movie': names,'year': years,'imdb': imdb_ratings,'metascore': metascores})
test_df
-
实际结果:
电影年 imdb 元评分
从前...在好莱坞 (2019) (8.1) (83)
恐怖故事 (2019) (6.5) (61)
速度与激情:霍布斯与肖 (2019) (6.8) (60)
复仇者联盟:终局之战 (2019) (8.6) (78)
-
预期:
movie1 year1 imbd1 metascore1 review1
movie1 year1 imbd1 metascore1 review2
...
movie1 year1 imbd1 metascore1 review20
movie2 year2 imbd2 metascore2 review1
...
movie2 year2 imbd2 metascore2 review20
...
movie250 year250 imbd250 metascore250 review20
【问题讨论】:
-
为什么要重复
movie1 year1 imbd1 metascore120 次? -
每部电影获得 20 条评论
-
是的,我明白了,但这并不意味着您必须为 250 部电影重复 20 项;不是数据库管理专家,但您可能应该考虑使用两个 DF,一个仅用于电影,一个仅用于评论,其中两个通过公共键相关,例如电影名称(如果它们都是唯一的)或电影您分配给每个 DF 并包含在两个 DF 中的 ID。
-
那么考虑到上面的评论,您是否仍然可以在结果数据框中重复每个电影名称和其他特征 20 次?
标签: python web-scraping beautifulsoup python-requests imdb