【问题标题】:Cannot scrape top selling products from shopee.com无法从 shopee.com 抓取最畅销的产品
【发布时间】:2020-07-19 18:18:25
【问题描述】:

我试图使用带有 requests 和 BeautifulSoup 包的 python 从印度尼西亚电子商务网站 https://shopee.co.id/top_products 上抓取顶级产品的名称、类别和销量。但我有很多麻烦。这是我的第一次尝试:

import requests
from bs4 import BeautifulSoup as bs

headers = {
    'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36',
    'cookie': '_gcl_au=1.1.961206468.1594951946; _med=refer; _fbp=fb.2.1594951949275.1940955365; SPC_IA=-1; SPC_F=y1evilme0ImdfEmNWEc08bul3d8toc33; REC_T_ID=fab983c8-c7d2-11ea-a977-ccbbfe23657a; SPC_SI=uv1y64sfvhx3w6dir503ixw89ve2ixt4; _gid=GA1.3.413262278.1594951963; SPC_U=286107140; SPC_EC=GwoQmu7TiknULYXKODlEi5vEgjawyqNcpIWQjoxjQEW2yJ3H/jsB1Pw9iCgGRGYFfAkT/Ej00ruDcf7DHjg4eNGWbCG+0uXcKb7bqLDcn+A2hEl1XMtj1FCCIES7k17xoVdYW1tGg0qaXnSz0/Uf3iaEIIk7Q9rqsnT+COWVg8Y=; csrftoken=5MdKKnZH5boQXpaAza1kOVLRFBjx1eij; welcomePkgShown=true; _ga=GA1.1.1693450966.1594951955; _dc_gtm_UA-61904553-8=1; REC_MD_30_2002454304=1595153616; _ga_SW6D8G0HXK=GS1.1.1595152099.14.1.1595153019.0; REC_MD_41_1000044=1595153318_0_50_0_49; SPC_R_T_ID="Am9bCo3cc3Jno2mV5RDkLJIVsbIWEDTC6ezJknXdVVRfxlQRoGDcya57fIQsioFKZWhP8/9PAGhldR0L/efzcrKONe62GAzvsztkZHfAl0I="; SPC_T_IV="IETR5YkWloW3OcKf80c6RQ=="; SPC_R_T_IV="IETR5YkWloW3OcKf80c6RQ=="; SPC_T_ID="Am9bCo3cc3Jno2mV5RDkLJIVsbIWEDTC6ezJknXdVVRfxlQRoGDcya57fIQsioFKZWhP8/9PAGhldR0L/efzcrKONe62GAzvsztkZHfAl0I="'
}

shopee_url = 'https://shopee.co.id/top_products'

response = requests.get(shopee_url, headers=headers)
response.json()

但它会引发“JSONDecodeError”,我认为这是因为我抓取的内容如下所示:view-source:https://shopee.co.id/top_products。这是我的第二次尝试:

import requests
from bs4 import BeautifulSoup as bs

headers = {
    'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36',
    'cookie': '_gcl_au=1.1.961206468.1594951946; _med=refer; _fbp=fb.2.1594951949275.1940955365; SPC_IA=-1; SPC_F=y1evilme0ImdfEmNWEc08bul3d8toc33; REC_T_ID=fab983c8-c7d2-11ea-a977-ccbbfe23657a; SPC_SI=uv1y64sfvhx3w6dir503ixw89ve2ixt4; _gid=GA1.3.413262278.1594951963; SPC_U=286107140; SPC_EC=GwoQmu7TiknULYXKODlEi5vEgjawyqNcpIWQjoxjQEW2yJ3H/jsB1Pw9iCgGRGYFfAkT/Ej00ruDcf7DHjg4eNGWbCG+0uXcKb7bqLDcn+A2hEl1XMtj1FCCIES7k17xoVdYW1tGg0qaXnSz0/Uf3iaEIIk7Q9rqsnT+COWVg8Y=; csrftoken=5MdKKnZH5boQXpaAza1kOVLRFBjx1eij; welcomePkgShown=true; _ga=GA1.1.1693450966.1594951955; _dc_gtm_UA-61904553-8=1; REC_MD_30_2002454304=1595153616; _ga_SW6D8G0HXK=GS1.1.1595152099.14.1.1595153019.0; REC_MD_41_1000044=1595153318_0_50_0_49; SPC_R_T_ID="Am9bCo3cc3Jno2mV5RDkLJIVsbIWEDTC6ezJknXdVVRfxlQRoGDcya57fIQsioFKZWhP8/9PAGhldR0L/efzcrKONe62GAzvsztkZHfAl0I="; SPC_T_IV="IETR5YkWloW3OcKf80c6RQ=="; SPC_R_T_IV="IETR5YkWloW3OcKf80c6RQ=="; SPC_T_ID="Am9bCo3cc3Jno2mV5RDkLJIVsbIWEDTC6ezJknXdVVRfxlQRoGDcya57fIQsioFKZWhP8/9PAGhldR0L/efzcrKONe62GAzvsztkZHfAl0I="'
}

shopee_url = 'https://shopee.co.id/top_products'
response = requests.get(shopee_url, headers=headers)
soup = bs(response.text, "html.parser")

products = soup.select("._3S8sOC _2QfAXF")
print(type(products))
print(products)

但这会返回一个空列表,我不知道为什么。感谢您阅读到这里!我在之前的网络爬虫练习中没有遇到过这些问题。

【问题讨论】:

    标签: python beautifulsoup python-requests web-crawler


    【解决方案1】:

    当您看到网站为加载内容而进行的网络调用时,内容是由 javascript 调用加载的。以下脚本提供了网站上所有不同选项卡的所有数据,例如 Kouta Data Internet、Hijab Instan 等......

    import requests, json
    
    res = requests.get("https://shopee.co.id/api/v4/recommend/recommend?bundle=top_sold_product_microsite&limit=20&offset=0")
    
    data_json = res.json()
    
    with open("data.json","w") as f:
        json.dump(data_json,f)
    

    上述脚本会将数据保存到 json 文件中。数据的样本输出

    {"data": {"update_time": 1595183508, "version": "1595183688", "sections": [{"total": 20, "key": "tspmicrosite_sec", "index": [{"data_type": "top_product", "key": "ID_V2L0_65"}, {"data_type": "top_product", "key": "ID_V2L0_3693"}, {"data_type": "top_product", "key": "ID_V2L0_2"}, {"data_type": "top_product", "key": "ID_V2L0_19"}, {"data_type": "top_product", "key": "ID_V2L0_75"}, {"data_type": "top_product", "key": "ID_V2L0_4040"}, {"data_type": "top_product", "key": "ID_V2L0_877"}, {"data_type": "top_product", "key": "ID_V2L0_15"}, {"data_type": "top_product", "key": "ID_V2L0_10"}, {"data_type": "top_product", "key": "ID_V2L0_7"}, {"data_type": "top_product", "key": "ID_V2L0_722"}, {"data_type": "top_product", "key": "ID_V2L0_285"}, {"data_type": "top_product", "key": "ID_V2L0_20"}, {"data_type": "top_product", "key": "ID_V2L0_66"}, {"data_type": "top_product", "key": "ID_V2L0_5831"}, {"data_type": "top_product", "key": "ID_V2L0_18"}, {"data_type": "top_product", "key": "ID_V2L0_16"}, {"data_type": "top_product", "key": "ID_V2L0_13"}, {"data_type": "top_product", "key": "ID_V2L0_34"}, {"data_type": "top_product", "key": "ID_V2L0_1493"}], "data": {"item": null, "keyword": null, "ads": null, "top_product": [{"info": "QUE:PTCPB,SLT:tspmicrosite_slot_00,TFS:tspmicrosite_slot_00_ID,SEC:tspmicrosite_sec_00,BND:top_sold_product_microsite,EPT:top_sold_product_microsite", "count": 2296973, "data_type": "top_product", "name": "Kuota Data Internet", "label": "ID_V2L0_65", "key": "ID_V2L0_65", "images": ["5c2b241a45c93374c154f0ef47feeb32", "8c650894988ea89258dc57604938ba9b", "b0b48c0e010c0626f9cdcecef7ba33d5"], "list": {"total": 40, "key": "ID_V2L0_65", "index": [{"data_type": "item_lite", "key": "item::122997341:2405999610"}, {"data_type": "item_lite", "key": "item::157202162:2813092958"}, {"data_type": "item_lite", "key": "item::172223406:7301485432"}, {"data_type": "item_lite", "key": "item::172223406:5801486070"}, {"data_type": "item_lite", "key": "item::57561999:1771712886"}, {"data_type": "item_lite", "key": "item::12216119:2020087641"}, {"data_type": "item_lite", "key": "item::172223406:5101486536"}, {"data_type": "item_lite", "key": "item::172223406:3810851792"}, {"data_type": "item_lite", "key": "item::6343942:61264134"}, {"data_type": "item_lite", "key": 
    ...
    ...
    

    【讨论】:

    • 你从哪里得到“shopee.co.id/api/v4/recommend/…”?我在 Inspect 窗口中搜索了网络区域,但没有找到。
    • 我已经投票了!但无法显示,因为这是我的第一个问题,而且我没有足够的声誉。但是非常感谢!我会接受答案。如果您能解释为什么选择“推荐”部分,我们将不胜感激。 ?
    • @LeonardLee 我赞成这个问题。现在你可以投票了
    • 点赞!但我仍然对为什么选择网络中的“推荐”部分感到困惑。
    • 这是网站要求提供信息的部分。将其视为仅用于提供信息的后端 API
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2016-01-19
    • 1970-01-01
    • 2016-03-03
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多