【问题标题】:Scraping URL from a Javascript loaded webpage从 Javascript 加载的网页中抓取 URL
【发布时间】:2021-04-08 16:24:27
【问题描述】:

我正在尝试抓取发布在 IMMOWEB 上 this link 下的所有广告的 href。 URL 由 Javascript 加载。我正在使用 HTMLSession 但无法获得我的结果。 这是我的代码:


url = 'https://www.immoweb.be/en/search/apartment/for-sale?countries=BE&isNewlyBuilt=false&maxBedroomCount=3&maxPrice=200000&maxSurface=130&minBedroomCount=1&minPrice=100000&minSurface=65&postalCodes=2000,2018,2060,2140,2170,2600,2610,2627,2640,2650,2660,2845,2850,2900,2980&page=1&orderBy=newest&card=9267356'

sessions = HTMLSession()  
r = sessions.get(url)  
r.html.render()  
soup = BeautifulSoup(r.content, "html.parser")  
print (soup)  

需要的输出:

https://www.immoweb.be/en/classified/apartment/for-sale/antwerpen-merksem/2170/9268787?searchId=606f2c6d4c669  
https://www.immoweb.be/en/classified/apartment/for-sale/merksem/2170/9268390?searchId=606f2c6d4c669
'And other hrefs'

【问题讨论】:

    标签: python python-3.x beautifulsoup python-requests


    【解决方案1】:

    网址是通过 JavaScript 动态构建的。但是您可以加载每个属性的 ID 并手动构建它(跟随 URL 将重定向到正确的 URL):

    import re
    import json
    import requests
    from html import unescape
    
    
    url = "https://www.immoweb.be/en/search/apartment/for-sale?countries=BE&isNewlyBuilt=false&maxBedroomCount=3&maxPrice=200000&maxSurface=130&minBedroomCount=1&minPrice=100000&minSurface=65&postalCodes=2000,2018,2060,2140,2170,2600,2610,2627,2640,2650,2660,2845,2850,2900,2980&page=1&orderBy=newest&card=9267356"
    
    html_doc = requests.get(url).text
    data = json.loads(unescape(re.search(r":results='(.*?)'", html_doc).group(1)))
    
    # uncomment to print all data:
    # print(json.dumps(data, indent=4))
    
    for p in data:
        print("https://www.immoweb.be/en/classified/{}".format(p["id"]))
    

    打印:

    https://www.immoweb.be/en/classified/9268787
    https://www.immoweb.be/en/classified/9268390
    https://www.immoweb.be/en/classified/9268389
    https://www.immoweb.be/en/classified/9268360
    https://www.immoweb.be/en/classified/9267356
    https://www.immoweb.be/en/classified/9266168
    https://www.immoweb.be/en/classified/9264424
    https://www.immoweb.be/en/classified/9264140
    https://www.immoweb.be/en/classified/9264032
    https://www.immoweb.be/en/classified/9263981
    https://www.immoweb.be/en/classified/9263142
    https://www.immoweb.be/en/classified/9261903
    https://www.immoweb.be/en/classified/9261838
    https://www.immoweb.be/en/classified/9261546
    https://www.immoweb.be/en/classified/9261343
    https://www.immoweb.be/en/classified/9261328
    https://www.immoweb.be/en/classified/9261133
    https://www.immoweb.be/en/classified/9260764
    https://www.immoweb.be/en/classified/9260370
    https://www.immoweb.be/en/classified/9214008
    https://www.immoweb.be/en/classified/9259711
    https://www.immoweb.be/en/classified/9258900
    https://www.immoweb.be/en/classified/9258810
    https://www.immoweb.be/en/classified/9258199
    https://www.immoweb.be/en/classified/9258195
    https://www.immoweb.be/en/classified/9258183
    https://www.immoweb.be/en/classified/9258179
    https://www.immoweb.be/en/classified/9215058
    https://www.immoweb.be/en/classified/9256793
    https://www.immoweb.be/en/classified/9256422
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2019-02-16
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2021-03-20
      • 1970-01-01
      相关资源
      最近更新 更多