使用 python 抓取网页数据答案

【问题标题】：scraping web data using python使用 python 抓取网页数据
【发布时间】：2014-11-10 04:14:30
【问题描述】：

我正在尝试编写用于从 imdb 前 250 个网页中抓取数据的代码。我写的代码如下。该代码有效，并给了我预期的结果。但我面临的问题在于代码返回的结果数量。当我在笔记本电脑上使用它时，它会产生 23 个结果，这是 imdb 列出的第 23 部电影。但是当我从我的一个朋友那里跑出来时，它会产生正确的 250 个结果。为什么会这样？应该怎么做才能避免这种情况？

from bs4 import BeautifulSoup
import requests
import sys
from StringIO import StringIO

try:
    import cPickle as pickle
except:
    import pickle

url = 'http://www.imdb.com/chart/top'

response = requests.get(url)
soup = BeautifulSoup(response.text)

movies = soup.select('td.titleColumn')
links = [a.attrs.get('href') for a in soup.select('td.titleColumn a')]
crew = [a.attrs.get('title') for a in soup.select('td.titleColumn a')]
ratings = [b.attrs.get('data-value') for b in soup.select('td.titleColumn span[name=ir]')]
votes = [b.attrs.get('data-value') for b in soup.select('td.ratingColumn strong')]

imdb = []

print(len(movies))

for index in range(0, len(movies)):
    data = {"movie": movies[index].get_text(),
            "link": links[index],
            "starCast": crew[index],
            "rating": ratings[index],
            "vote": votes[index]}
    imdb.append(data)

print(imdb)


Test Run from my laptop result :
['9.21', '9.176', '9.015', '8.935', '8.914', '8.903', '8.892', '8.889', '8.877', '8.817', '8.786', '8.76', '8.737', '8.733', '8.716', '8.703', '8.7', '8.69', '8.69', '8.678', '8.658', '8.629', '8.619']
23

【问题讨论】：

有趣 - 我运行它时得到 250 个结果。也许检查您使用的是哪个版本的 python 和 BeautifulSoup？然后分解代码 - 检查response.text 以准确查看每种情况下包含的内容，看看问题出在请求还是解析器。
@trvrm 我尝试了几个解析器 - 结果相同 - 250。
谢谢.. 我在两个系统中都使用了完全相同的软件和软件包版本。 python 版本是 2.7.6.. 对于 BeautifulSoup，我使用了pip install 方法.. 是的，存储在soup 中的response.text 包含问题案例中的第一个 23 个条目。
我也收到了所有 250 个结果。您是否收到包含 23 个条目的完整 HTML 文档？文档是否以

标签： python web-scraping beautifulsoup

【解决方案1】：

我意识到这是一个很老的问题，但我很喜欢这个想法，可以让代码更好地工作。它现在通过变量提供更多的个人数据。我为自己修复了它，但我想我会在这里分享，希望它可以帮助其他人。

#!/usr/bin/env Python3
# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup
import requests
import re

# Download IMDB's Top 250 data
url = 'http://www.imdb.com/chart/top'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'lxml')

movies = soup.select('td.titleColumn')
links = [a.attrs.get('href') for a in soup.select('td.titleColumn a')]
crew = [a.attrs.get('title') for a in soup.select('td.titleColumn a')]
ratings = [b.attrs.get('data-value') for b in soup.select('td.posterColumn span[name=ir]')]
votes = [b.attrs.get('data-value') for b in soup.select('td.ratingColumn strong')]

imdb = []

# Store each item into dictionary (data), then put those into a list (imdb)
for index in range(0, len(movies)):
    # Seperate movie into: 'place', 'title', 'year'
    # Instead of "2.       The Godfather        (1972)"
    movie_string = movies[index].get_text()
    movie = (' '.join(movie_string.split()).replace('.', ''))
    movie_title = movie[len(str(index))+1:-7]
    year = re.search('\((.*?)\)', movie_string).group(1)
    place = movie[:len(str(index))-(len(movie))]
    data = {"movie_title": movie_title,
            "year": year,
            "place": place,
            "star_cast": crew[index],
            "rating": ratings[index],
            "vote": votes[index],
            "link": links[index]}
    imdb.append(data)

# Print out some info
for item in imdb:
    print(item['place'], '-', item['movie_title'], '('+item['year']+') -', 'Starring:', item['star_cast'])

【讨论】：