【问题标题】:Why are the 'Genre' data not written into the .csv file in my code为什么我的代码中没有将“流派”数据写入 .csv 文件
【发布时间】:2019-09-02 01:55:27
【问题描述】:

我正在尝试使用 beautifulsoup 学习网页抓取,并且我已经实现了这段代码。但是只有电影标题被写入 csv 文件,而不是流派,尽管它们都已被检索。

网址:http://www.imdb.com/search/title?sort=num_votes,desc&start=1&title_type=feature&year=1950,2012

f = csv.writer(open('movie-names.csv', 'w'))
f.writerow(['Title', 'Genre'])

pages = []
genre;


for i in range(1,2):
    url = 'http://www.imdb.com/search/title?sort=num_votes,desc&start=1&title_type=feature&year=1950,2012'
    pages.append(url)


for item in pages:
    page = requests.get(item)
    soup = BeautifulSoup(page.text, 'html.parser')

    movie_titles = soup.find_all(class_ = 'lister-item-content')

    for movie_title in movie_titles:
        title = movie_title.find('a').contents[0]
        genre = movie_title.find_all(class_ = 'genre')[0].get_text()
        print(genre)
        f.writerow([title, genre])

【问题讨论】:

  • 你代码开头的genre;是不是错字?

标签: python python-3.x csv web-scraping beautifulsoup


【解决方案1】:

使用pandas导出CSV中的数据要容易得多。

from bs4 import BeautifulSoup
import requests
import pandas as pd
pages = []

for i in range(1,2):
    url = 'http://www.imdb.com/search/title?sort=num_votes,desc&start=1&title_type=feature&year=1950,2012'
    pages.append(url)

Movie_title=[]
Movie_genre=[]
for item in pages:
    page = requests.get(item)
    soup = BeautifulSoup(page.text, 'html.parser')

    movie_titles = soup.select('.lister-item-content')

    for movie_title in movie_titles:
        title = movie_title.select_one('a').text
        Movie_title.append(title)
        genre = movie_title.select_one('.genre').text.replace('\n','')
        Movie_genre.append(genre)


df = pd.DataFrame({"Movie_title":Movie_title,"Movie_genre":Movie_genre})
df.to_csv("movie-names.csv",index=False)

输出:

【讨论】:

  • 非常感谢!!这可行,但仍然无法理解为什么其他实现对我不起作用。可能版本不同?
【解决方案2】:

这应该可行:

import requests
from bs4 import BeautifulSoup
import csv

with open("movie-names.csv", "w") as f:
    writer = csv.writer(f)
    writer.writerow(['Title', 'Genre'])

    pages = []
    genre = []


    for i in range(1,2):
        url = 'http://www.imdb.com/search/title?sort=num_votes,desc&start=1&title_type=feature&year=1950,2012'
        pages.append(url)


    for item in pages:
        page = requests.get(item)
        soup = BeautifulSoup(page.text, 'html.parser')

        movie_titles = soup.find_all(class_ = 'lister-item-content')

        for movie_title in movie_titles:
            title = movie_title.find('a').contents[0]
            genre = movie_title.find_all(class_ = 'genre')[0].get_text()
            print(title, genre)
            writer.writerow([title, genre])

这是我运行代码的 .csv 中内容的摘录:

Title   Genre
The Shawshank Redemption     Drama            
The Dark Knight  Action, Crime, Drama            
Inception    Action, Adventure, Sci-Fi            
Fight Club   Drama            
Pulp Fiction     Crime, Drama            
Forrest Gump     Drama, Romance  

注意这个for循环:

for i in range(1,2):
    url = 'http://www.imdb.com/search/title?sort=num_votes,desc&start=1&title_type=feature&year=1950,2012'
    pages.append(url)

将毫无用处,因为在这种情况下只附加一个 url。在2n 的一般情况下,它会附加相同 url n-1 次。这是你的意图吗?

【讨论】:

  • 很遗憾没有。问题依然存在:(
  • 我得到一个两列51行的csv,看看我的答案。
猜你喜欢
  • 1970-01-01
  • 2020-10-07
  • 2019-11-09
  • 2013-10-01
  • 1970-01-01
  • 1970-01-01
  • 2021-09-08
  • 1970-01-01
  • 2021-05-15
相关资源
最近更新 更多