【问题标题】:Asynchronous scraping with Python: grequests and Beautifulsoup4使用 Python 进行异步抓取:grequests 和 Beautifulsoup4
【发布时间】:2017-05-02 11:30:16
【问题描述】:

我正在尝试抓取 this 网站。我设法通过使用 urllib 和 beautifulsoup 来做到这一点。但是 urllib 太慢了。我想要异步请求,因为网址有数千个。我发现一个不错的包是 grequests。

示例:

import grequests
from bs4 import BeautifulSoup

pages = []
page="https://www.spitogatos.gr/search/results/residential/sale/r100/m100m101m102m103m104m105m106m107m108m109m110m150m151m152m153m154m155m156m157m158m159m160m161m162m163m164m165m166m167m168m169m170m171m172m173m174m175m176m177m178m179m180m181m182m183m184m185m186m187m188m189m190m191m192m193m194m195m196m197m198m106001m125000m"
for i in range(1,1000):
    pages.append(page)
    page="https://www.spitogatos.gr/search/results/residential/sale/r100/m100m101m102m103m104m105m106m107m108m109m110m150m151m152m153m154m155m156m157m158m159m160m161m162m163m164m165m166m167m168m169m170m171m172m173m174m175m176m177m178m179m180m181m182m183m184m185m186m187m188m189m190m191m192m193m194m195m196m197m198m106001m125000m"
    page = page + "/offset_{}".format(i*10)

rs = (grequests.get(item) for item in pages)
a=grequests.map(rs)

问题是我不知道如何继续使用beautifulsoup。从而得到每个页面的html代码。 很高兴听到你的想法。谢谢!

【问题讨论】:

  • 我建议你试试Scrapy。该框架建立在 Twisted 异步网络库之上,比 bs4urllib 更快。

标签: python-3.x asynchronous web-scraping beautifulsoup grequests


【解决方案1】:

参考下面的脚本,还要检查源的链接。会有帮助的。

reqs = (grequests.get(link) for link in links)
resp=grequests.imap(reqs, grequests.Pool(10))
 
for r in resp:
   soup = BeautifulSoup(r.text, 'lxml')
   results = soup.find_all('a', attrs={"class":'product__list-name'})
   print(results[0].text)
   prices = soup.find_all('span', attrs={'class':"pdpPriceMrp"})
   print(prices[0].text)
   discount = soup.find_all("div", attrs={"class":"listingDiscnt"})
   print(discount[0].text)

来源:https://blog.datahut.co/asynchronous-web-scraping-using-python/

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 2020-11-07
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2019-01-31
    • 1970-01-01
    • 2015-09-16
    • 1970-01-01
    相关资源
    最近更新 更多