使用 Python 抓取多个网页答案

【问题标题】：Scraping multiple webpages with Python使用 Python 抓取多个网页
【发布时间】：2017-12-04 09:27:50
【问题描述】：

from bs4 import BeautifulSoup
import urllib, time
class scrape(object):
    def __init__(self):
        self.urls = ['https://www.onthemarket.com/for-sale/property/wigan/', 'https://www.onthemarket.com/for-sale/property/wigan/?page=1', 'https://www.onthemarket.com/for-sale/property/wigan/?page=2', 'https://www.onthemarket.com/for-sale/property/wigan/?page=3', 'https://www.onthemarket.com/for-sale/property/wigan/?page=4', 'https://www.onthemarket.com/for-sale/property/wigan/?page=6']
        self.telephones = []
    def extract_info(self):
        for link in self.urls:
            data = urllib.request.urlopen(link).read()
            soup = BeautifulSoup(data, "lxml")
            for tel in soup.findAll("span", {"class":"call"}):
                self.telephones.append(tel.text.strip())
            time.sleep(1)
        return self.telephones

to = scrape()
print(to.extract_info())

怎么了？此代码在第二个网站之后挂起。它应该从列表self.urls中的每个网页中提取电话号码

【问题讨论】：

如果您遇到任何错误，请同时发布
我试过你的代码，一切正常。 [9.3s完成]
没有错误。 python shell 正在工作，但没有返回任何内容。我将 Spyder 与 Python 3.6 一起使用。我等了超过 5 分钟，但什么也没发生。
确定不是网络问题？正在处理的网址在挂起时是否可以访问？
ventik，是的，可能是网络问题，但在我的情况下，前两个站点被正确抓取，但之后无缘无故挂起。 ventik 你用的是什么python IDE？

标签： python python-3.x web-scraping beautifulsoup web-crawler

【解决方案1】：

您需要做的就是在您的请求参数中添加一个headers 并开始尝试。试试这个：

from bs4 import BeautifulSoup
import requests, time

class scrape(object):

    def __init__(self):
        self.urls = ['https://www.onthemarket.com/for-sale/property/wigan/', 'https://www.onthemarket.com/for-sale/property/wigan/?page=1', 'https://www.onthemarket.com/for-sale/property/wigan/?page=2', 'https://www.onthemarket.com/for-sale/property/wigan/?page=3', 'https://www.onthemarket.com/for-sale/property/wigan/?page=4', 'https://www.onthemarket.com/for-sale/property/wigan/?page=6']
        self.telephones = []

    def extract_info(self):
        for link in self.urls:
            data = requests.get(link,headers={"User-Agent":"Mozilla/5.0"}) #it should do the trick
            soup = BeautifulSoup(data.text, "lxml")
            for tel in soup.find_all("span",{"class":"call"}):
                self.telephones.append(tel.text.strip())
            time.sleep(1)
        return self.telephones

crawl = scrape()
print(crawl.extract_info())

【讨论】：

顺便说一句，在您的情况下，您发现两个网站在工作，其余的都没有，但在我的情况下，我所拥有的是一个空白列表。但是，在将标头放入请求参数后，我让它完美地工作@FootAdministration。
谢谢Shahin，它对我有用！很好的答案！祝你有美好的一天！