【问题标题】:Scraping multiple webpages with Python使用 Python 抓取多个网页
【发布时间】:2017-12-04 09:27:50
【问题描述】:
from bs4 import BeautifulSoup
import urllib, time
class scrape(object):
    def __init__(self):
        self.urls = ['https://www.onthemarket.com/for-sale/property/wigan/', 'https://www.onthemarket.com/for-sale/property/wigan/?page=1', 'https://www.onthemarket.com/for-sale/property/wigan/?page=2', 'https://www.onthemarket.com/for-sale/property/wigan/?page=3', 'https://www.onthemarket.com/for-sale/property/wigan/?page=4', 'https://www.onthemarket.com/for-sale/property/wigan/?page=6']
        self.telephones = []
    def extract_info(self):
        for link in self.urls:
            data = urllib.request.urlopen(link).read()
            soup = BeautifulSoup(data, "lxml")
            for tel in soup.findAll("span", {"class":"call"}):
                self.telephones.append(tel.text.strip())
            time.sleep(1)
        return self.telephones

to = scrape()
print(to.extract_info())

怎么了?此代码在第二个网站之后挂起。它应该从列表self.urls中的每个网页中提取电话号码

【问题讨论】:

  • 如果您遇到任何错误,请同时发布
  • 我试过你的代码,一切正常。 [9.3s完成]
  • 没有错误。 python shell 正在工作,但没有返回任何内容。我将 Spyder 与 Python 3.6 一起使用。我等了超过 5 分钟,但什么也没发生。
  • 确定不是网络问题?正在处理的网址在挂起时是否可以访问?
  • ventik,是的,可能是网络问题,但在我的情况下,前两个站点被正确抓取,但之后无缘无故挂起。 ventik 你用的是什么python IDE?

标签: python python-3.x web-scraping beautifulsoup web-crawler


【解决方案1】:

您需要做的就是在您的请求参数中添加一个headers 并开始尝试。试试这个:

from bs4 import BeautifulSoup
import requests, time

class scrape(object):

    def __init__(self):
        self.urls = ['https://www.onthemarket.com/for-sale/property/wigan/', 'https://www.onthemarket.com/for-sale/property/wigan/?page=1', 'https://www.onthemarket.com/for-sale/property/wigan/?page=2', 'https://www.onthemarket.com/for-sale/property/wigan/?page=3', 'https://www.onthemarket.com/for-sale/property/wigan/?page=4', 'https://www.onthemarket.com/for-sale/property/wigan/?page=6']
        self.telephones = []

    def extract_info(self):
        for link in self.urls:
            data = requests.get(link,headers={"User-Agent":"Mozilla/5.0"}) #it should do the trick
            soup = BeautifulSoup(data.text, "lxml")
            for tel in soup.find_all("span",{"class":"call"}):
                self.telephones.append(tel.text.strip())
            time.sleep(1)
        return self.telephones

crawl = scrape()
print(crawl.extract_info())

【讨论】:

  • 顺便说一句,在您的情况下,您发现两个网站在工作,其余的都没有,但在我的情况下,我所拥有的是一个空白列表。但是,在将标头放入请求参数后,我让它完美地工作@FootAdministration。
  • 谢谢Shahin,它对我有用!很好的答案!祝你有美好的一天!
猜你喜欢
  • 2021-12-17
  • 1970-01-01
  • 1970-01-01
  • 2017-05-17
  • 2013-04-29
  • 1970-01-01
  • 1970-01-01
相关资源
最近更新 更多