首页我们有一个起始url:https://goodbaby.tmall.com/shop/view_shop.htm?spm=a230r.7195193.1997079397.2.3RayhH
我们要采取的是它里面所有宝贝,按销量排序,如图:
点击进去,我们可以看到列表页的链接:
我们查看源代码,可以发现淘宝的商品数据藏在js里面的:
我们找到他的接口 ,直接发起请求,从Headers直接找到他的url,然后对它发起请求,把里面的p改一下,p代表的是当前页数,有多少页,就给他个遍历.
最后把爬取的的数据存到excel里面,就ok了,最后附上代码:
import requests import json,re import xlsxwriter import pymysql workbook=xlsxwriter.Workbook("e:\\data.xlsx") worksheet=workbook.add_worksheet() worksheet.write('A1','item_id') worksheet.write('B1','title') worksheet.write('C1','img') worksheet.write('D1','sold') worksheet.write('E1','quantity') worksheet.write('F1','totalSoldQuantity') worksheet.write('G1','url') worksheet.write('H1','price') i=1 def createExcle(item_id, title, img, sold, quantity, totalSoldQuantity, url, price, i): worksheet.write('A%s' % i, item_id) worksheet.write('B%s' % i, title) worksheet.write('C%s' % i, img) worksheet.write('D%s' % i, sold) worksheet.write('E%s' % i, quantity) worksheet.write('F%s' % i, totalSoldQuantity) worksheet.write('G%s' % i, url) worksheet.write('H%s' % i, price) for x in range(1,36): url='https://goodbaby.m.tmall.com/shop/shop_auction_search.do?spm=a1z60.7754813.0.0.301755f0pZ1GjU&suid=379833581&sort=s&p='+str(x)+'&page_size=12&from=h5&shop_id=60650834&ajson=1&_tm_source=tmallsearch' headers = { 'User-Agent': r'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Safari/537.36', 'Referer': r'https://goodbaby.m.tmall.com/shop/shop_auction_search.htm?spm=a1z60.7754813.0.0.301755f0pZ1GjU&suid=379833581&sort=default', # 'Connection': r'keep-alive', } file=requests.get(url,headers=headers).text file1=json.loads(file) #print(file1) items=(file1.get('items')) for a in items: print(a) item_id=a.get('item_id') title=a.get('title') img=a.get('img') sold=a.get('sold') quantity=a.get('quantity') totalSoldQuantity=a.get('totalSoldQuantity') url=a.get('url') price=a.get('price') i+=1 createExcle(item_id,title,img,sold,quantity,totalSoldQuantity,url,price,i) workbook.close()
这是最后爬取的效果: