【问题标题】:Python BeautifulSoup is not returning all html tagsPython BeautifulSoup 没有返回所有 html 标签
【发布时间】:2020-04-01 11:30:37
【问题描述】:

我使用了从网站中提取房地产数据的代码。我的代码工作正常,但它仅提取 30 个容器的数据,而有 3000 多个容器可用。我才知道我漂亮的汤没有得到所有的 html 标签

我的代码:

import requests
from bs4 import BeautifulSoup

r = requests.get("https://www.magicbricks.com/property-for-sale/residential-real-estate?proptype=Multistorey-Apartment,Builder-Floor-Apartment,Penthouse,Studio-Apartment,Residential-House,Villa,Residential-Plot&Locality=OMR-Road&cityName=Chennai",
                 headers={'User-agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:61.0) Gecko/20100101 Firefox/61.0'})
c = r.content
bs = BeautifulSoup(c,"html5lib")
# print(bs.prettify())
soup = bs.findAll("div", {"class": "flex relative clearfix m-srp-card__container"})
print(len(soup))

【问题讨论】:

  • 这很难说,因为我无法访问该 URL(“访问被拒绝”) - 你能提供另一个 URL 吗?

标签: python beautifulsoup python-requests


【解决方案1】:

这是因为网站在您向下滚动时使用 JavaScript 加载项目。

第一种方法是循环浏览页面:

import requests
from bs4 import BeautifulSoup
try:
        url = 'https://www.magicbricks.com/property-for-sale/residential-real-estate?proptype=Multistorey-Apartment,Builder-Floor-Apartment,Penthouse,Studio-Apartment,Residential-House,Villa,Residential-Plot&Locality=OMR-Road&cityName=Chennai&page={}'
        divs = []
        for page_num in range (1,100):
                print(f'Getting page {page_num}')
                url      = url.format(page_num)
                response = requests.get(url,headers={'User-agent': 'Mozilla/5.0'})
                soup = BeautifulSoup(response.content,"html.parser")
                divs.extend(soup.findAll("div",{"class":"flex relative clearfix m-srp-card__container"}))
        print(len(divs))
except Exception as e:
        print(e)

第二种方法是通过使用 selenium 模拟浏览器向下滚动:

from time import sleep
from selenium import webdriver

def ScrollDown(driver,interal=5,looper=5000):
    scroll_delay = interal
    count = 0

    ''' Get scroll height'''

    last_height = driver.execute_script("return document.body.scrollHeight")

    while count < looper:
        print('Scrolling down to bottom loop {}/{}'.format(count+1,looper))
        ''' Scroll down to bottom'''
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

        ''' Wait to load page'''
        print('sleeping {} secs'.format(interal))
        sleep(scroll_delay)

        ''' Calculate new scroll height and compare with last scroll height'''
        new_height = driver.execute_script("return document.body.scrollHeight")

        if new_height == last_height:
            break

        last_height = new_height
        count += 1


url = 'https://www.magicbricks.com/property-for-sale/residential-real-estate?proptype=Multistorey-Apartment,Builder-Floor-Apartment,Penthouse,Studio-Apartment,Residential-House,Villa,Residential-Plot&Locality=OMR-Road&cityName=Chennai'


driver = webdriver.Chrome()
try:
        driver.get(url)
        ScrollDown(driver)
        divs = driver.find_elements_by_css_selector('div[class="flex relative clearfix m-srp-card__container"]')
        print(len(divs))
except Exception as e:
        print(e)
finally:
        if driver is not None :
                driver.close()

【讨论】:

    【解决方案2】:

    页面是通过JavaScript 动态加载的,因此我已经能够跟踪呈现数据的XHR 请求。您可以直接调用它。下面是获取前 10 页的示例。

    import requests
    from bs4 import BeautifulSoup
    
    params = {
        'propertyType_new': '10002_10003_10021_10022,10001_10017,10000',
        'localityNameSEO': 'Old Mahabalipuram Road',
        'postedSince': '1',
        'localityName': 'OMR Road',
        'city': '5196',
        'searchType': '1',
        'propertyType': '10002,10003,10021,10022,10001,10017,10000',
        'disWeb': 'Y',
        'pType': '10002,10003,10021,10022,10001,10017,10000',
        'category': '5',
        'localityId': '89568',
        'cusImgCount': '0',
        'groupstart': '28',
        'maxOffset': '107',
        'attractiveIds': '',
        'ltrIds': '47881083,47881047',
        'preCompiledProp': '',
        'excludePropIds': '',
        'addpropertyDataSet': ''
    }
    
    
    def main(url):
        with requests.Session() as req:
            for item in range(1, 11):
                params['page'] = item
                r = req.get(url, params=params)
                soup = BeautifulSoup(r.content, 'html.parser')
                # now parse what you want
    
    
    main("https://www.magicbricks.com/mbsearch/propertySearch.html")
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2017-10-06
      • 2014-11-17
      • 2021-01-18
      • 2021-02-11
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多