我正在尝试使用 python 脚本在页面中查找特定链接

【问题标题】：I'm trying to find a specific link in a page with a python script我正在尝试使用 python 脚本在页面中查找特定链接
【发布时间】：2021-01-25 11:26:43
【问题描述】：

我正在尝试弄清楚如何从给定站点中仅提取带有特定文本的链接

这是我使用的程序：

from BeautifulSoup import BeautifulSoup
import urllib2
import re

html_page = urllib2.urlopen("https://www.example.net/")
soup = BeautifulSoup(html_page)
linkContent = "Tartan Flannel Shirt "
for link in soup.findAll('a'):
    print link.get('href')

Html链接就是这样的

<a class="name-link" href="/shop/all/shirts">Tartan Flannel Shirt </a>

如果我运行上面的程序，输出是网站中每个链接的列表，但我希望它只显示带有 Tartan Flannel Shirt 的链接。

【问题讨论】：

值得注意的是，您缺少右引号 - 因此您的脚本中存在语法错误
使用 python 2 ?
@PatrickArtner 你建议使用 Python3 吗？
@AleksJ 谢谢我没注意到
如果你使用它，我建议标记python 2.x - 它已经死了，python 3.9 潜伏着所以你很落后。

标签： python beautifulsoup urllib2

【解决方案1】：

您可以为.find_all() 的text= 参数提供lambda 函数。例如：

from bs4 import BeautifulSoup


html_doc = '''
    <a href="#1">Something else</a>
    <a href="#2">This link contains Tartan Flannel Shirt</a>
    <a href="#3">Something else</a>
'''

soup = BeautifulSoup(html_doc, 'html.parser')

for link in soup.find_all('a', text=lambda t: 'Tartan Flannel Shirt' in t):
    print(link)

打印：

<a href="#2">This link contains Tartan Flannel Shirt</a>

同样，你可以通过这种方式搜索属性，例如链接的href=属性：

from bs4 import BeautifulSoup


html_doc = '''
    <a href="http://link1">Link1</a>
    <a href="http://link2">Link2</a>
    <a href="http://link3">Link3</a>
'''

soup = BeautifulSoup(html_doc, 'html.parser')

for link in soup.find_all('a', href=lambda t: 'link2' in t):
    print(link)

打印：

<a href="http://link2">Link2</a>

Link to beautifulsoup API.

【讨论】：