【问题标题】:Scraping multiple page on this site HELP NEEDED在此站点上抓取多个页面需要帮助
【发布时间】:2021-07-02 18:24:41
【问题描述】:

您好,我希望能够为该网站抓取多个页面 有人可以帮助我如何抓取所有页面,我只能从一页获取信息,但我只能从一页获取信息

    headers = ({'User-Agent':
            'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'})

for i in range(2000):
  Centris ='https://www.centris.ca/en/commercial-units~for-rent~montreal-ville-marie/26349148?view=Summary'.format(i)

r = get(Centris, headers=headers)

soup = bs(r.text, 'html.parser')

results = soup.find_all('div', attrs={'id':'divMainResult'})

data = []
for result in results:
  
  titre = result.find('span', attrs={'data-id': 'PageTitle'})
  titre = [str(titre.string).strip() for titre in titre]

  superficie = result.find('div', attrs={'class': 'carac-value'}, string=re.compile('sqft'))
  superficie = [str(superficie.string).strip() for superficie in superficie]

  emplacement = result.find_all('h2', attrs={'class': 'pt-1'})
  emplacement = [str(emplacement.string).strip() for emplacement in emplacement]
 
 
  prix =  result.find_all('span', attrs={'class':'text-nowrap'})
  prix = [(prix.text).strip('\w.') for prix in prix]
  
  
  
  description = result.find_all('div', attrs={'itemprop': 'description'})
  description = [str(description.string).strip() for description in description]
  
  
  lien = result.find_all('a', attrs={'class': 'dropdown-item js-copy-clipboard'})

【问题讨论】:

标签: python web web-scraping


【解决方案1】:

要使分页正常工作,您可以使用 requests 模块模拟 Ajax 请求:

import json
import requests
from bs4 import BeautifulSoup


url = "https://www.centris.ca/Property/GetInscriptions"
headers = {
    "User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:89.0) Gecko/20100101 Firefox/89.0",
}
json_data = {"startPosition": 0}

with requests.session() as s:

    # load cookies:
    s.get(
        "https://www.centris.ca/en/commercial-units~for-rent?uc=0",
        headers=headers,
    )
    for page in range(0, 100, 20):  # <-- increase number of pages here
        json_data["startPosition"] = page

        data = s.post(url, headers=headers, json=json_data).json()
        soup = BeautifulSoup(data["d"]["Result"]["html"], "html.parser")

        for a in soup.select(".a-more-detail"):
            print(a.select_one(".category").get_text(strip=True))
            print(a.select_one(".address").get_text(strip=True, separator="\n"))
            print("https://www.centris.ca" + a["href"])
            print("-" * 80)

打印:

Commercial unit for rent
6560, Avenue de l'Esplanade, suite 105
Montréal (Rosemont/La Petite-Patrie)
Neighbourhood La Petite-Patrie
https://www.centris.ca/en/commercial-units~for-rent~montreal-rosemont-la-petite-patrie/16168393?view=Summary
--------------------------------------------------------------------------------
Commercial unit for rent
75, Rue  Principale
Gatineau (Aylmer)
Neighbourhood Vieux Aylmer, Des Cèdres, Marina
https://www.centris.ca/en/commercial-units~for-rent~gatineau-aylmer/22414903?view=Summary
--------------------------------------------------------------------------------
Commercial building for rent
53, Rue  Saint-Pierre, suite D
Saint-Pie
https://www.centris.ca/en/commercial-buildings~for-rent~saint-pie/15771470?view=Summary
--------------------------------------------------------------------------------

...and so on.

【讨论】:

  • 嗨,我试过这个,但是当把它放在 excel 中时,我得到最多 19 个结果,即使我为范围内的页面添加了更多页面(128):#
  • @CatsTv 将结果放在一个列表中,在 with 语句之后从该列表创建一个数据框。之后,将其保存为excel文件。例如,范围必须是range(0, 1000, 20)
  • 非常感谢您的帮助
【解决方案2】:

非常感谢,我想出了这个,效果很好

import json
import requests
from bs4 import BeautifulSoup
import pandas as pd


url = "https://www.centris.ca/Property/GetInscriptions"
headers = {
    "User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:89.0) Gecko/20100101 Firefox/89.0",
}
json_data = {"startPosition": 0}

with requests.session() as s:
    Centris = []
    # load cookies:
    s.get(
        "https://www.centris.ca/en/commercial-units~for-rent?uc=0",
        headers=headers,
    )
    for page in range(0, 100, 20):  # <-- increase number of pages here
        json_data["startPosition"] = page

        data = s.post(url, headers=headers, json=json_data).json()
        soup = BeautifulSoup(data["d"]["Result"]["html"], "html.parser")

        for a in soup.select(".a-more-detail"):
            titre = a.select_one(".category").get_text(strip=True)
            emplacement = a.select_one(".address").get_text(strip=True, separator="\n")
            lien = "https://www.centris.ca" + a["href"]
            prix = a.select_one(".price").get_text(strip=True)
            

            Centris.append((titre, emplacement, lien, prix))


df = pd.DataFrame(Centris, columns={'Titre':titre, 'Emplacement':emplacement, 'Lien':lien, 'Prix':prix})


writer = pd.ExcelWriter('Centris.xlsx')

df.to_excel(writer)

writer.save()
print( 'Data Saved To excel' )

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多