【发布时间】:2021-10-07 17:22:29
【问题描述】:
这是 html 的相关部分,正在被抓取:
<div class="blockSpoiler-content">
<div class="contentSpoiler">
<div class="link-box" id="62H" style="background-color: rgb(65, 120, 50);">
<div class="status-box"><i class="working" title="Working"></i></div>
<a rel="external" href="https://url1.net.html" target="_blank">Link1</a>
</div>
<div class="link-box" id="IFA" style="background-color: rgb(65, 120, 50);">
<div class="status-box"><i class="working" title="Working"></i></div>
<a rel="external" href="https://url2.net.html" target="_blank">Link2</a>
</div>
<div class="link-box" id="ruG" style="background-color: rgb(65, 120, 50);">
<div class="status-box"><i class="working" title="Working"></i></div>
<a rel="external" href="https://url3.com.html" target="_blank">Link3</a>
</div>
<div class="link-box" id="Bdf" style="background-color: rgb(65, 120, 50);">
<div class="status-box"><i class="working" title="Working"></i></div>
<a rel="external" href="https://url4.com" target="_blank">Link4</a>
</div>
<div class="link-box" id="1Da" style="background-color: rgb(65, 120, 50);">
<div class="status-box"><i class="working" title="Working"></i></div>
<a rel="external" href="https://url5.net.html" target="_blank">Link5</a>
</div>
</div>
</div>
我正在尝试获取这些 URL:
- https://url1.net.html
- https://url2.net.html
- https://url3.com.html
- https://url4.com
- https://url5.net.html
我尝试了不同的东西,但只到了这里 (本地文件仅用于测试目的,在网页抓取之前):
with open("mainLocalFile.html") as fp:
soup2 = BeautifulSoup(fp, 'html.parser')
links = soup2.find_all('div', class_='blockSpoiler-content')
# print(links)
for link in links:
print(link)
print(link.a) # prints only the first tag
print(link.a['href']) # prints only the first URL
【问题讨论】:
标签: python web-scraping beautifulsoup