【问题标题】:Is there a way to scrape ads' url from SeLoger?有没有办法从 SeLoger 中抓取广告的网址?
【发布时间】:2019-12-11 05:42:05
【问题描述】:

我正在尝试抓取法国网站 SeLoger,我可以找到并抓取所有广告并将其放入 Json 中。 问题是我无法通过这种方式找到广告的最终网址。 该 URL 位于一个名为“cartouche”的 div 中,其类为 c-pa-link link_AB。


import requests
from bs4 import BeautifulSoup
import json


url = 'https://www.seloger.com/list.htm?tri=initial&enterprise=0&idtypebien=2,1&idtt=2,5&naturebien=1,2,4&ci=440109'
headers = {
    'User-Agent': '*',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    'Accept-Language': 'en-US,en;q=0.5',
    'Accept-Encoding': 'gzip, deflate',
    'Connection': 'keep-alive',
    'Upgrade-Insecure-Requests': '1'
    }


s = requests.Session()
s.headers.update(headers)

r = s.get(url)

soup = BeautifulSoup(r.text, 'html.parser')

for script_item in soup.find_all('script'):
    if 'var ava_data' in script_item.text:
        raw_json = script_item.text.split('var ava_data = ')[1].split('};')[0] + "}"


data = json.loads(raw_json)

print(data)

我希望像这样在 json 中放置一个字段。


{
            "url":"https://www.seloger.com/annonces/achat/appartement/nantes-44/centre-ville/144279775.htm?enterprise=0&natures=1,4&places=%5b%7bci%3a440109%7d%5d&projects=2,5&qsversion=1.0&types=1,2&bd=ListToDetail",
            "idannonce": "149546457",
            "idagence": "294918",
            "idtiers": "323172",
            "typedebien": "Appartement",
            "typedetransaction": [
                "viager"
            ],
            "idtypepublicationsourcecouplage": "SL",
            "position": "2",
            "codepostal": "44100",
            "ville": "Nantes",
            "departement": "Loire-Atlantique",
            "codeinsee": "440109",
            "produitsvisibilite": "AD:AC:BX:AW",
            "affichagetype": [
                {
                    "name": "liste",
                    "value": "True"
                }
            ],
            "cp": "44100",
            "etage": "0",
            "idtypechauffage": "0",
            "idtypecommerce": "0",
            "idtypecuisine": "séparée équipée",
            "naturebien": "1",
            "si_balcon": "1",
            "nb_chambres": "1",
            "nb_pieces": "2",
            "si_sdbain": "0",
            "si_sdEau": "0",
            "nb_photos": "15",
            "prix": "32180",
            "surface": "41"
        }

感谢您的帮助。

【问题讨论】:

    标签: python json web-scraping beautifulsoup


    【解决方案1】:

    您可以使用zip() 函数将产品从 json 数据“绑定”到网页中的 URL:

    import requests
    from bs4 import BeautifulSoup
    import json
    
    url = 'https://www.seloger.com/list.htm?tri=initial&enterprise=0&idtypebien=2,1&idtt=2,5&naturebien=1,2,4&ci=440109'
    headers = {
        'User-Agent': '*',
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
        'Accept-Language': 'en-US,en;q=0.5',
        'Accept-Encoding': 'gzip, deflate',
        'Connection': 'keep-alive',
        'Upgrade-Insecure-Requests': '1'
        }
    
    s = requests.Session()
    s.headers.update(headers)
    
    r = s.get(url)
    
    soup = BeautifulSoup(r.text, 'html.parser')
    
    for script_item in soup.find_all('script'):
        if 'var ava_data' in script_item.text:
            raw_json = script_item.text.split('var ava_data = ')[1].split('};')[0] + "}"
    
    data = json.loads(raw_json)
    
    for a, p in zip(soup.select('.c-pa-info > a'), data['products']):
        p['url'] = a['href']
    
    print(json.dumps(data, indent=4))
    

    打印:

    ...
    
    {
        "idannonce": "139994713",
        "idagence": "48074",
        "idtiers": "24082",
        "typedebien": "Appartement",
        "typedetransaction": [
            "vente"
        ],
        "idtypepublicationsourcecouplage": "SL9",
        "position": "16",
        "codepostal": "44000",
        "ville": "Nantes",
        "departement": "Loire-Atlantique",
        "codeinsee": "440109",
        "produitsvisibilite": "AM:AC:BB:BX:AW",
        "affichagetype": [
            {
                "name": "liste",
                "value": true
            }
        ],
        "cp": "44000",
        "etage": "0",
        "idtypechauffage": "0",
        "idtypecommerce": "0",
        "idtypecuisine": "0",
        "naturebien": "2",
        "si_balcon": "0",
        "nb_chambres": "0",
        "nb_pieces": "3",
        "si_sdbain": "0",
        "si_sdEau": "0",
        "nb_photos": "4",
        "prix": "147900",
        "surface": "63",
        "url": "https://www.selogerneuf.com/annonces/achat/appartement/nantes-44/139994713/#?cmp=INTSL_ListToDetail"
    },
    {
        "idannonce": "146486955",
        "idagence": "334754",
    
    ...
    

    注意:有些 URL 的结构与

    不同
    https://www.seloger.com/annonces/achat/appartement/nantes-44/centre-ville/{idannonce}.htm?ci=440109&enterprise=0&idtt=2,5&idtypebien=2,1&naturebien=1,2,4&tri=initial&bd=ListToDetail
    

    例如

    https://www.selogerneuf.com/annonces/investissement/appartement/nantes-44/146486955/#?cmp=INTSL_ListToDetail
    

    【讨论】:

    • 谢谢你,Andrej,我回家后会努力的:)
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 2018-06-09
    • 2017-12-29
    • 1970-01-01
    • 2015-03-25
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多