Python BeautifulSoup 没有返回所有 html 标签答案

【问题标题】：Python BeautifulSoup is not returning all html tagsPython BeautifulSoup 没有返回所有 html 标签
【发布时间】：2020-04-01 11:30:37
【问题描述】：

我使用了从网站中提取房地产数据的代码。我的代码工作正常，但它仅提取 30 个容器的数据，而有 3000 多个容器可用。我才知道我漂亮的汤没有得到所有的 html 标签

我的代码：

import requests
from bs4 import BeautifulSoup

r = requests.get("https://www.magicbricks.com/property-for-sale/residential-real-estate?proptype=Multistorey-Apartment,Builder-Floor-Apartment,Penthouse,Studio-Apartment,Residential-House,Villa,Residential-Plot&Locality=OMR-Road&cityName=Chennai",
                 headers={'User-agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:61.0) Gecko/20100101 Firefox/61.0'})
c = r.content
bs = BeautifulSoup(c,"html5lib")
# print(bs.prettify())
soup = bs.findAll("div", {"class": "flex relative clearfix m-srp-card__container"})
print(len(soup))

【问题讨论】：

这很难说，因为我无法访问该 URL（“访问被拒绝”） - 你能提供另一个 URL 吗？

标签： python beautifulsoup python-requests

【解决方案1】：

这是因为网站在您向下滚动时使用 JavaScript 加载项目。

第一种方法是循环浏览页面：

import requests
from bs4 import BeautifulSoup
try:
        url = 'https://www.magicbricks.com/property-for-sale/residential-real-estate?proptype=Multistorey-Apartment,Builder-Floor-Apartment,Penthouse,Studio-Apartment,Residential-House,Villa,Residential-Plot&Locality=OMR-Road&cityName=Chennai&page={}'
        divs = []
        for page_num in range (1,100):
                print(f'Getting page {page_num}')
                url      = url.format(page_num)
                response = requests.get(url,headers={'User-agent': 'Mozilla/5.0'})
                soup = BeautifulSoup(response.content,"html.parser")
                divs.extend(soup.findAll("div",{"class":"flex relative clearfix m-srp-card__container"}))
        print(len(divs))
except Exception as e:
        print(e)

第二种方法是通过使用 selenium 模拟浏览器向下滚动：

from time import sleep
from selenium import webdriver

def ScrollDown(driver,interal=5,looper=5000):
    scroll_delay = interal
    count = 0

    ''' Get scroll height'''

    last_height = driver.execute_script("return document.body.scrollHeight")

    while count < looper:
        print('Scrolling down to bottom loop {}/{}'.format(count+1,looper))
        ''' Scroll down to bottom'''
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

        ''' Wait to load page'''
        print('sleeping {} secs'.format(interal))
        sleep(scroll_delay)

        ''' Calculate new scroll height and compare with last scroll height'''
        new_height = driver.execute_script("return document.body.scrollHeight")

        if new_height == last_height:
            break

        last_height = new_height
        count += 1


url = 'https://www.magicbricks.com/property-for-sale/residential-real-estate?proptype=Multistorey-Apartment,Builder-Floor-Apartment,Penthouse,Studio-Apartment,Residential-House,Villa,Residential-Plot&Locality=OMR-Road&cityName=Chennai'


driver = webdriver.Chrome()
try:
        driver.get(url)
        ScrollDown(driver)
        divs = driver.find_elements_by_css_selector('div[class="flex relative clearfix m-srp-card__container"]')
        print(len(divs))
except Exception as e:
        print(e)
finally:
        if driver is not None :
                driver.close()

【讨论】：

【解决方案2】：

页面是通过JavaScript 动态加载的，因此我已经能够跟踪呈现数据的XHR 请求。您可以直接调用它。下面是获取前 10 页的示例。

import requests
from bs4 import BeautifulSoup

params = {
    'propertyType_new': '10002_10003_10021_10022,10001_10017,10000',
    'localityNameSEO': 'Old Mahabalipuram Road',
    'postedSince': '1',
    'localityName': 'OMR Road',
    'city': '5196',
    'searchType': '1',
    'propertyType': '10002,10003,10021,10022,10001,10017,10000',
    'disWeb': 'Y',
    'pType': '10002,10003,10021,10022,10001,10017,10000',
    'category': '5',
    'localityId': '89568',
    'cusImgCount': '0',
    'groupstart': '28',
    'maxOffset': '107',
    'attractiveIds': '',
    'ltrIds': '47881083,47881047',
    'preCompiledProp': '',
    'excludePropIds': '',
    'addpropertyDataSet': ''
}


def main(url):
    with requests.Session() as req:
        for item in range(1, 11):
            params['page'] = item
            r = req.get(url, params=params)
            soup = BeautifulSoup(r.content, 'html.parser')
            # now parse what you want


main("https://www.magicbricks.com/mbsearch/propertySearch.html")

【讨论】：