【发布时间】:2019-07-10 20:24:19
【问题描述】:
我正在尝试获取 SF 编年史中此类别中每篇文章的链接,但我不确定应该从哪里开始提取 URL。这是我到目前为止的进展:
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
my_url = 'https://www.sfchronicle.com/local/'
# opening up connection, grabbing the page
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
# html parsing
page_soup = soup(page_html, "html.parser")
zone2_container = page_soup.findAll("div",{"class":"zone zone-2"})
zone3_container = page_soup.findAll("div",{"class":"zone zone-3"})
zone4_container = page_soup.findAll("div",{"class":"zone zone-4"})
right_rail_container = page_soup.findAll("div",{"class":"right-rail"})
我想要的所有链接都位于 zone2-4_container 和 right_rail_container 中。
【问题讨论】:
-
只需从
<a>标签中选择href属性(例如:urls = [i['href'] for i in page_soup.select('div.zone.zone-2 a')]) -
我怎样才能为 div.zone.zone-1 做到这一点?
标签: python-3.x web-scraping beautifulsoup html-parsing