【问题标题】:How do you get selenium webdriver to return all HTML from website?你如何让 selenium webdriver 从网站返回所有 HTML?
【发布时间】:2019-03-23 01:56:18
【问题描述】:

我正在尝试从 https://www.utahrealestate.com/search/map.search/page/1 抓取房地产列表,但无法让 selenium 的网络驱动程序抓取所有 html。

据我所知,该网站正在使用 javascript 函数在地图上动态加载列表。

它不会返回包含标记下所需数据的 HTML,而是返回如下内容:

<div id="results-listings">
<div style="height: 400px;"></div>
</div>
</div>
</div>
<!--right ad zone-->
<div class="advert-160-600 advert-right-zone" data-google-query-id="CKDYtP2Ol-ECFVAMswAd7vcDAg" id="div-gpt-ad-1533933823557-0" style="">
<div id="google_ads_iframe_/21730996110/UtahRealEstate/ListingResults/Right-Side-160x600_0__container__" style="border: 0pt none; display: inline-block; width: 160px; height: 600px;"><iframe data-google-container-id="1" data-is-safeframe="true" data-load-complete="true" frameborder="0" height="600" id="google_ads_iframe_/21730996110/UtahRealEstate/ListingResults/Right-Side-160x600_0" marginheight="0" marginwidth="0" name="" sandbox="allow-forms allow-pointer-lock allow-popups allow-popups-to-escape-sandbox allow-same-origin allow-scripts allow-top-navigation-by-user-activation" scrolling="no" src="https://tpc.googlesyndication.com/safeframe/1-0-32/html/container.html" style="border: 0px; vertical-align: bottom;" title="3rd party ad content" width="160"></iframe></div></div>
<div id="map_notification"></div>
<div id="map_markers_container" style="display: none;"></div>
</div>
</div>
<div class="advert-728-90" data-google-query-id="CKHYtP2Ol-ECFVAMswAd7vcDAg" id="div-gpt-ad-1533933779531-0" style="margin-top: 15px">
<div id="google_ads_iframe_/21730996110/UtahRealEstate/ListingResults/Center-Below-Map-728x90_0__container__" style="border: 0pt none;"><iframe data-google-container-id="2" data-load-complete="true" frameborder="0" height="90" id="google_ads_iframe_/21730996110/UtahRealEstate/ListingResults/Center-Below-Map-728x90_0" marginheight="0" marginwidth="0" name="google_ads_iframe_/21730996110/UtahRealEstate/ListingResults/Center-Below-Map-728x90_0" scrolling="no" srcdoc="" style="border: 0px; vertical-align: bottom;" title="3rd party ad content" width="728"></iframe></div></div>
<div class="container" style="margin-top: 20px;">
<p style="margin: 20px 0 40px 0;">UtahRealEstate.com is Utah's favorite place to find a home. MLS Listings are provided by the Wasatch Front Regional Multiple Listing Service, Inc., which is powered by Utah's REALTORS®. UtahRealEstate.com offers you the most complete and current property information available. Browse our website to find an accurate list of homes for sale in Utah and homes for sale in Southeastern Idaho.</p>
<h5>Find Utah Homes for Sale by City</h5>
<div class="row">
<div class="col-sm-7 five-three">
<div class="row">
<div class="col-sm-4">
<b><a href="/davis-county-homes">Davis County</a></b>
<ul>
<li><a href="/bountiful-homes">Bountiful</a></li>
<li><a href="/clearfield-homes">Clearfield</a></li>
<li><a href="/clinton-homes">Clinton</a></li>
<li><a href="/layton-homes">Layton</a></li>
<li><a href="/kaysville-homes">Kaysville</a></li>
<li><a href="/north-salt-lake-homes">North Salt Lake</a></li>
<li><a href="/south-weber-homes">South Weber</a></li>
<li><a href="/syracuse-homes">Syracuse</a></li>
<li><a href="/woods-cross-homes">Woods Cross</a></li>

我当前的代码如下所示:

from selenium import webdriver
from bs4 import BeautifulSoup as soup

utahRealEstate = 'https://www.utahrealestate.com/search/map.search/page/1'
browser = webdriver.Chrome()
page = browser.get(utahRealEstate)
innerHTML = browser.execute_script("return document.body.innerHTML")

page_soup = soup(innerHTML)
page_soup

我真的很喜欢“listings-info-left-col”和“listings-info-right-col”类中包含的信息。

我对此很陌生,所以请尽量减少你的解释。感谢您的帮助!

【问题讨论】:

    标签: javascript selenium-webdriver web-scraping


    【解决方案1】:

    下面计算分页信息(以便在分页信息发生变化时更加灵活)并循环所有可用结果的页面。它将价格、房产地址和房产详细信息提取到一个列表列表中,该列表被展平、转换为数据框并写入 csv。正则表达式用于整理输出信息。它使用等待条件来获得可用的信息。

    from selenium import webdriver
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.support import expected_conditions as EC
    import re
    import math
    from bs4 import BeautifulSoup as bs
    import pandas as pd
    
    def getInfo(html): #function to return price and other listing info for the current page. Accepts the page source html as parameter
        soup = bs(html, 'lxml')
        items = soup.select('.inline_info')
        rowsToReturn = []
        for item in items:
            data = item.select('.list-info-content') #list containing address info and property details e.g. baths, beds
            price = item.select_one('h3').text.strip()
            address = re.sub('\s\s+', ' ',  data[0].text.strip()) #replace 2+ white space with single space
            propertyInfo = re.sub('\s\s+', ' ',  data[1].text.strip())
            rowToReturn = [price, address, propertyInfo]
            rowsToReturn.append(rowToReturn)
        return rowsToReturn
    
    url = 'https://www.utahrealestate.com/search/map.search/page/1' #landing page
    driver = webdriver.Chrome()
    driver.get(url)
    WebDriverWait(driver,10).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, ".list-info-content"))) #wait for all listings content
    
    reg = re.compile(r'(\d+)') #regex pattern looking for 1 or more numbers to be applied to class view-results which has the pagination and total results info
    matches = reg.findall(driver.find_element_by_css_selector('.view-results').text) # [1,50,500] from 1 to 50 of 500
    numResults = int(matches[2])
    resultsPerPage = int(matches[1])
    numPages = math.ceil(numResults/resultsPerPage)
    
    results = []
    results.append(getInfo(driver.page_source)) #add page one results
    
    if numPages > 1: 
        for page in range(2, numPages + 1): #loop calculated number of pages 
            driver.get('https://www.utahrealestate.com/search/map.search/page/{}'.format(page)) #add new page number into url
            WebDriverWait(driver,10).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, ".list-info-content"))) #wait for all listings content
            results.append(getInfo(driver.page_source)) #add next page results
    
    #flatten list of lists
    finalList = [item for sublist in results for item in sublist]
    
    df = pd.DataFrame(finalList, columns = ['price', 'address', 'property details']) #convert to dataframe and write to csv
    df.to_csv(r'C:\Users\User\Desktop\Data.csv', sep=',', encoding='utf-8-sig',index = False )
    driver.quit()
    

    示例结果:

    【讨论】:

    • 谢谢 QHarr,所以通过查看我的代码,我在收集所有信息时缺少的主要内容是什么?是 WebDriverWait 语句吗?
    • 嗨。所以,是的,这绝对是一个关键部分。上面还有什么你想解释的吗?我试图在整个代码中进行注释。
    【解决方案2】:

    此代码从第一页开始,解析它以获取详细信息,然后继续加载其余页面,一次一个地解析它们以获取详细信息,直到没有更多页面剩下。如果您愿意,可以对其进行改进,以满足您的确切需求。

    from selenium import webdriver
    from bs4 import BeautifulSoup
    import time
    from selenium.common.exceptions import NoSuchElementException
    
    utahRealEstate = 'https://www.utahrealestate.com/search/map.search/page/1'
    browser = webdriver.Chrome()
    page = browser.get(utahRealEstate)
    
    
    # parse the page
    def parse(html):
        soup = BeautifulSoup(html, 'html.parser')
        for i in soup.find_all('div', {'class': 'listings-info'}):
            print(i.get_text())
    
    
    while True:
        try:
            # parse the current page.
            time.sleep(3)
            parse(browser.page_source)
            # Find the next page button and click it.
            browser.find_element_by_xpath("//a[text()='Next ']").click()
        except NoSuchElementException:
            # Couldn't find a next page button must have got to the end.
            break
    
    browser.quit()
    

    输出:

    $615,000
    3217 W 10305 S
    South Jordan, UT 84095
    
    
    5Beds
    5Baths
    4002Sq.Ft.
    #1588082
    
    Domain Real Estate LLC
    ...
    

    【讨论】:

    • 谢谢 Dan-Dev。代码运行良好,但我仍然无法理解为什么您的代码可以运行,但是当我尝试运行 browser = webdriver.Chrome() page = browser.get(utahRealEstate) soup = BeautifulSoup(browser.page_source, 'html.parser') soup.find_all('div', {'class': 'listings-info'}) 时它没有返回任何内容。主要断开连接是什么?
    • soup = BeautifulSoup(browser.page_source, 'html.parser') 之前添加time.sleep(3) 就可以了。您需要等待页面加载。
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 2018-03-16
    • 1970-01-01
    • 2021-05-13
    • 1970-01-01
    • 2017-12-18
    • 2020-05-09
    • 1970-01-01
    相关资源
    最近更新 更多