【问题标题】:How to scrape multiple pages with an unchanging URL - Python & BeautifulSoup如何使用不变的 URL 抓取多个页面 - Python & BeautifulSoup
【发布时间】:2017-08-15 04:08:45
【问题描述】:

我正在尝试抓取这个网站:https://www.99acres.com

到目前为止,我已经使用 BeautifulSoup 来执行代码并从网站中提取数据;但是,我的代码现在只能让我获得第一页。我想知道是否有办法访问其他页面,因为当我点击下一页时,URL 不会改变,所以我不能每次都遍历不同的 URL。

以下是我目前的代码:

import io
import csv
import requests
from bs4 import BeautifulSoup

response = requests.get('https://www.99acres.com/search/property/buy/residential-all/hyderabad?search_type=QS&search_location=CP1&lstAcn=CP_R&lstAcnId=1&src=CLUSTER&preference=S&selected_tab=1&city=269&res_com=R&property_type=R&isvoicesearch=N&keyword_suggest=hyderabad%3B&bedroom_num=3&fullSelectedSuggestions=hyderabad&strEntityMap=W3sidHlwZSI6ImNpdHkifSx7IjEiOlsiaHlkZXJhYmFkIiwiQ0lUWV8yNjksIFBSRUZFUkVOQ0VfUywgUkVTQ09NX1IiXX1d&texttypedtillsuggestion=hy&refine_results=Y&Refine_Localities=Refine%20Localities&action=%2Fdo%2Fquicksearch%2Fsearch&suggestion=CITY_269%2C%20PREFERENCE_S%2C%20RESCOM_R&searchform=1&price_min=null&price_max=null')
html = response.text
soup = BeautifulSoup(html, 'html.parser')
list=[]

dealer = soup.findAll('div',{'class': 'srpWrap'})

for item in dealer:
    try:
        p = item.contents[1].find_all("div",{"class":"_srpttl srpttl fwn wdthFix480 lf"})[0].text
    except:
        p=''
    try:
        d = item.contents[1].find_all("div",{"class":"lf f13 hm10 mb5"})[0].text
    except:
        d=''

    li=[p,d]
    list.append(li)


with open('project.txt','w',encoding="utf-8") as file:
    writer= csv.writer(file)
    for row in list:
        writer.writerows(row)

file.close()

【问题讨论】:

    标签: python python-3.x web-scraping beautifulsoup


    【解决方案1】:

    试试这个。它将为您提供从第 1 页到第 3 页的不同属性名称。

    import requests ; from bs4 import BeautifulSoup
    
    base_url = "https://www.99acres.com/3-bhk-property-in-hyderabad-ffid-page-{0}" 
    for url in [base_url.format(i) for i in range(1,4)]:
        response = requests.get(url)
        soup = BeautifulSoup(response.text,"html.parser")
        for title in soup.select("a[id^=desc_]"):
            print(title.text.strip())
    

    【讨论】:

      【解决方案2】:

      我从未使用过beautifulSoup,但这里有一个通用的方法:在加载页面时,您应该从AJAX 响应中索引JSON 格式的响应。这是使用curl 的示例:

      curl 'https://www.99acres.com/do/quicksearch/getresults_ajax' -H 'pragma: no-cache' -H 'origin: https://www.99acres.com' -H 'accept-encoding: gzip, deflate, br' -H 'accept-language: en-US,en;q=0.8,de;q=0.6,da;q=0.4' -H 'user-agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.101 Safari/537.36' -H 'content-type: application/x-www-form-urlencoded' -H 'accept: */*' -H 'cache-control: no-cache' -H 'authority: www.99acres.com' -H 'cookie: 99_ab=37; NEW_VISITOR=1; 99_FP_VISITOR_OFFSET=87; 99_suggestor=37; 99NRI=2; PROP_SOURCE=IP; src_city=-1; 99_citypage=-1; sl_prop=0; 99_defsrch=n; RES_COM=RES; kwp_last_action_id_type=2784981911907674%2CSEARCH%2C402278484965075610; 99_city=38; spd=%7B%22P%22%3A%7B%22a%22%3A%22R%22%2C%22b%22%3A%22S%22%2C%22c%22%3A%22R%22%2C%22d%22%3A%22269%22%2C%22j%22%3A%223%22%7D%7D; lsp=P; 99zedoParameters=%7B%22city%22%3A%22269%22%2C%22locality%22%3Anull%2C%22budgetBucket%22%3Anull%2C%22activity%22%3A%22SRP%22%2C%22rescom%22%3A%22RES%22%2C%22preference%22%3A%22BUY%22%2C%22nri%22%3A%22YES%22%7D; GOOGLE_SEARCH_ID=402278484965075610; _sess_id=1oFlv%2B%2FPAnDwWEEZiIGqNUTFrkARButJKqqEYu%2Fcv5WKMZCNYvpc89tievPnYatE28uBWbcd0PTpvCp9k3O20w%3D%3D; newRequirementsByUser=0' -H 'referer: https://www.99acres.com/3-bhk-property-in-hyderabad-ffid?orig_property_type=R&search_type=QS&search_location=CP1&pageid=QS' --data 'src=PAGING&static_search=1&nextbutton=Next%20%BB&page=2&button_next=2&lstAcnId=2784981911907674&encrypted_input=UiB8IFFTIHwgUyB8IzcjICB8IENQMSB8IzQjICB8IDMgIzE1I3wgIHwgMzExODQzMzMsMzExODM5NTUgfCAgfCAyNjkgfCM1IyAgfCBSICM0MCN8ICA%3D&lstAcn=SEARCH&sortby=&is_ajax=1' --compressed
      

      这样可以调整page参数。

      【讨论】:

      • 很抱歉打扰你,但我不明白你在说什么。您能否修改我的代码,以便我能够从下一页中提取数据。
      【解决方案3】:

      是的,一旦您转到后续页面,网址就会被重写。但是,链接在那里;有第三页:https://www.99acres.com/3-bhk-property-in-hyderabad-ffid-page-3

      【讨论】:

      • 是的,该网站正在通过链接打开;但是当我通过它运行代码时,它没有收到任何数据..它显示给我一个空白文档。
      【解决方案4】:

      这是修改后的代码,没有收到任何数据。

      import time
      import io
      import csv
      import requests
      from bs4 import BeautifulSoup
      list=[]
      for i in range(1, 101):
          time.sleep(2)
          url = "https://www.99acres.com/3-bhk-property-in-hyderabad-ffid-page-{0}".format(i)
          response = requests.get(url)
          html = response.text
          soup = BeautifulSoup(html, 'html.parser')
      
      
          dealer = soup.findAll('div',{'class': 'srpWrap'})
      
          for item in dealer:
              try:
                  p = item.contents[1].find_all("div",{"class":"_srpttl srpttl fwn wdthFix480 lf"})[0].text
              except:
                  p=''
              try:
                  d = item.contents[1].find_all("div",{"class":"lf f13 hm10 mb5"})[0].text
              except:
                  d=''
      
              li=[p,d]
              list.append(li)
      
      
          with open('project.txt','w',encoding="utf-8") as file:
              writer= csv.writer(file)
              for row in list:
                  writer.writerows(row)
      
          file.close()
      

      【讨论】:

        猜你喜欢
        • 2021-06-23
        • 1970-01-01
        • 2014-12-17
        • 1970-01-01
        • 2020-11-07
        • 2019-07-18
        • 2023-03-20
        • 1970-01-01
        • 1970-01-01
        相关资源
        最近更新 更多