【发布时间】:2021-01-05 16:18:54
【问题描述】:
也许之前已经回答过这个或类似的问题,这可能足以为我指明正确的方向?该网站一次加载 24 个列表,然后它有一个查看更多结果按钮,该按钮加载接下来的 24 个,将前 24 个保留在那里,直到您点击 96 个列表,然后它只保留总共 96 个,但是每次我尝试过用漂亮的汤刮它,我只得到前 24 个。我用 selenium 所做的每一次尝试都没有产生任何结果,我计划明天用文档来看看这些失败,并可能在这个问题上添加更多内容或想出一些东西up,但我的直觉说,漂亮的汤是要走的路,或者把它吸起来,一次复制粘贴 96 个,然后用正则表达式和/或 pandas 处理它(耸肩表情符号)
我正在尝试抓取 mls 列表,并且有一些运气,页面一次加载 24 个,并且它会使以前的列表保持一段时间,使用漂亮的汤我可以通过从外层html
url = """https://matrix.heartlandmls.com/Matrix/Public/Portal.aspx?k=990316X949Z&p=DE-77667588-490"""
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
# this extracts sale price
soup.find_all("span", {"class":"d-fontSize--largest"})
# returns the following: <span class="d-fontSize--largest">$50,000</span>
接下来我提取地址如下
soup.find_all("div", {"class":"col-sm-12 d-fontSize--largest d-text d-color--brandDark"})
# this returns:
<div class="col-sm-12 d-fontSize--largest d-text d-color--brandDark">
<span class="formula J_formula"><a href="javascript:__doPostBack('_ctl0$m_DisplayCore','Redisplay|4526,,0')">403 W Main Street</a></span></div>
接下来我从下面的外部 html 中获取城镇、州和邮政编码
soup.find_all("div", {"class":"col-sm-12 d-fontSize--small d-textSoft d-paddingBottom--8"})
# which returns a list of these
<div class="col-sm-12 d-fontSize--small d-textSoft d-paddingBottom--8">
<span class="formula J_formula"> Cleveland, MO 64734</span></div>
接下来我可能会在这里使用一些帮助,但我认为我看到了足够的模式来自己获得它
soup.find_all("span", {"class":"d-textStrong d-paddingRight--5"})
#this returns the following
[<span class="d-textStrong d-paddingRight--5">3</span>,
<span class="d-textStrong d-paddingRight--5">1</span>,
<span class="d-textStrong d-paddingRight--5">1</span>,
<span class="d-textStrong d-paddingRight--5">1,307</span>,
<span class="d-textStrong d-paddingRight--5">Single Family</span>,
<span class="d-textStrong d-paddingRight--5">2</span>,
<span class="d-textStrong d-paddingRight--5">1</span>,
<span class="d-textStrong d-paddingRight--5">0</span>,
<span class="d-textStrong d-paddingRight--5">960</span>,
<span class="d-textStrong d-paddingRight--5">Single Family</span>,
# which looks to have the format
#bedrooms
#bathrooms
#half baths
#square footage
#residence type
现在引出了我的问题,页面底部是一个“查看更多结果”按钮 这是外部 html 和完整的 html:
#outer
<a role="button" class="btn mtx-btn-brandAlt" id="_ctl0_m_DisplayCore_dpy121" href="javascript:PortalResultsJs.getNextDisplaySet();">See More Results</a>
#full
<!--Paging Next link-->
<div id="_ctl0_m_divPagedListingsNext" class="mtx-pageMore j-resultsPageNext hidden-print" style="display: block;"><a role="button" class="btn mtx-btn-brandAlt" id="_ctl0_m_DisplayCore_dpy121" href="javascript:PortalResultsJs.getNextDisplaySet();">See More Results</a></div>
当我点击它时,它只会带来接下来的 24 个结果,并且不会将我带到新页面,所以我不知道我会做什么来抓取它?
可能有更好的方法来抓取这个,所以这里是整个列表的 html
<div class=" col-lg-7 col-md-6 col-sm-12">
<div class="row">
<div class="col-xs-9 col-sm-8 col-md-8 col-lg-8">
<span class="d-fontSize--largest">$90,000</span><span class="d-paddingLeft--6 d-paddingBottom--2"></span></div>
<div class="col-xs-3 d-textAlign--right col-sm-4 col-md-3">
<span class="formula J_formula"><div class="dropdown mtx-dropdownModal mtx-bucketSelector j-portalBucketSelector" data-key="51065729" data-currentbucket="0"><a href="#" title="Save as Favorite" onclick="Dpy.changeDropDownPosition( this );" class="mtx-btn-link mtx-icon mtx-icon-bucketNone j-portalBucketSelectorIcon" data-toggle="dropdown" style="display:inline-block;"></a><ul class="dropdown-menu is-bucketNone mtx-bucketSelector-menu j-portalBucketSelector-menu"><li class="mtx-bucket--favoriteRemove"><a href="#" onclick="Dpy.clickPortalBucketResponsive("51065729","6",event);"><span class="mtx-btn-link mtx-icon mtx-icon--small mtx-icon-bucketFavoriteRemove"></span><span class="mtx-textSoft" style="vertical-align:middle;">Remove from Favorites</span></a></li><li class="mtx-bucket--possibilitiesRemove"><a href="#" onclick="Dpy.clickPortalBucketResponsive("51065729","4",event);"><span class="mtx-btn-link mtx-icon mtx-icon--small mtx-icon-bucketPossibilitiesRemove"></span><span class="mtx-textSoft" style="vertical-align:middle;">Remove from Possibilities</span></a></li><li class="mtx-bucket--discardsRemove"><a href="#" onclick="Dpy.clickPortalBucketResponsive("51065729","2",event);"><span class="mtx-btn-link mtx-icon mtx-icon--small mtx-icon-bucketDiscardsRemove"></span><span class="mtx-textSoft" style="vertical-align:middle;">Remove from Discards</span></a></li><li class="mtx-bucket--favorite"><a href="#" onclick="Dpy.clickPortalBucketResponsive("51065729","6",event);"><span class="mtx-btn-link mtx-icon mtx-icon--small mtx-icon-bucketFavorite"></span><span class="mtx-textSoft" style="vertical-align:middle;">Save as Favorite</span></a></li><li class="mtx-bucket--possibilities"><a href="#" onclick="Dpy.clickPortalBucketResponsive("51065729","4",event);"><span class="mtx-btn-link mtx-icon mtx-icon--small mtx-icon-bucketPossibilities"></span><span class="mtx-textSoft" style="vertical-align:middle;">Save as Possibility</span></a></li><li class="mtx-bucket--discards"><a href="#" onclick="Dpy.clickPortalBucketResponsive("51065729","2",event);"><span class="mtx-btn-link mtx-icon mtx-icon--small mtx-icon-bucketDiscards"></span><span class="mtx-textSoft" style="vertical-align:middle;">Discard Listing</span></a></li></ul></div></span>
</div>
<div class=" col-xs-9 d-fontSize--small col-sm-8 col-md-8 col-lg-8">
<span class="formula J_formula"><span class="Status_SOLD">Sold</span></span></div>
</div>
<div class="row">
<div class=" col-sm-12 d-fontSize--largest d-text d-color--brandDark">
<span class="formula J_formula"><a href="javascript:__doPostBack('_ctl0$m_DisplayCore','Redisplay|4526,,9')">1324 W Campbell Boulevard</a></span></div>
<div class=" col-sm-12 d-fontSize--small d-textSoft d-paddingBottom--8">
<span class="formula J_formula"> Raymore, MO 64083</span></div>
<div class=" col-sm-12">
</div>
<div class=" col-sm-12">
<div class="row"><div class="col-sm-12"><span class="d-textStrong d-paddingRight--5">2</span><span class="d-text d-fieldsSeparatorComma">Bedrms</span><span class="d-textStrong d-paddingRight--5">2</span><span class="d-text d-fieldsSeparatorComma">Full Bath(s)</span><span class="d-textStrong d-paddingRight--5">0</span><span class="d-text d-fieldsSeparatorComma">Half Bath(s)</span><span class="d-textStrong d-paddingRight--5">1,386</span><span class="d-text d-fieldsSeparatorComma">Sqft</span><span class="d-text d-paddingRight--5">Built in</span><span class="d-textStrong d-fieldsSeparatorComma">1980</span><span class="d-textStrong d-paddingRight--5">Patio/Villa</span></div></div><div class="row"></div><div class="row"></div><div class="row"></div></div>
<div class=" col-sm-12">
</div><div class=" col-sm-12">
</div><div class=" col-sm-12">
</div>
<div class="col-sm-12 hidden-sm d-paddingTop--4 d-paddingBottom--4 hidden-md hidden-xs">
<span class="d-textSoft">Experience maintenance free living in this premier retirement communitie! Enjoy numerous amenities...</span></div><div class=" col-sm-12">
【问题讨论】:
标签: html selenium web-scraping beautifulsoup