【发布时间】:2021-03-11 22:53:44
【问题描述】:
我有以下 Python 代码来从页面路径中抓取锚文本链接和相应的 href 值:
from requests_html import HTMLSession
from urllib.request import urlopen
from bs4 import BeautifulSoup
import requests
url="https://www.mydomain.co.uk/contact-us"
session = HTMLSession()
r = session.get(url)
b = requests.get(url)
soup = BeautifulSoup(b.text, "lxml")
for link in soup.find_all('a'):
print(link.text, '-', link.get('href'))
它工作正常,但它也会抓取图像链接并输出“-”如果它是图像。例如:
Contact Us - /contact-us
About Us - /about
- /locations
我希望它忽略任何图像 href 链接,因此输出为:
Contact Us - /contact-us
About Us - /about
这可能吗?
谢谢
【问题讨论】:
标签: python web-scraping beautifulsoup