findall 正则表达式字符串使用什么漂亮的汤？答案

【问题标题】：What beautiful soup findall regex string to use?findall 正则表达式字符串使用什么漂亮的汤？
【发布时间】：2017-06-05 02:29:04
【问题描述】：

我在表单的 HTML 中有链接

<a href="/downloadsServlet?docid=abc" target="_blank">Report 1</a>
<a href="/downloadsServlet?docid=ixyz" target="_blank">Fetch Report 2 </a>

我可以使用 BeautifulSoup 获得上述表格的链接列表

我的代码如下

from bs4 import BeautifulSoup
html_page = urllib2.urlopen(url)
soup = BeautifulSoup(html_page)
listOfLinks = list(soup.findall('a'))

但是，我想在引用链接的文本中找到包含“Fetch”一词的链接。

我试过表格

soup.findAll('a', re.compile(".*Fetch.*"))

但这不起作用。如何仅选择具有 href 且文本部分中包含“Fetch”一词的标签？

【问题讨论】：

标签： python regex web-scraping beautifulsoup

【解决方案1】：

import re
soup.findAll('a', text = re.compile("Fetch"))

您可以使用正则表达式作为过滤器，它将使用re.search 方法过滤我们的标签。

text/string是标签的文本值，text = re.compile("Fetch")表示找到文本值包含'Fetch'的标签

Document

还有一件事，使用find_all()或findAll()，findall()在bs4中不是关键字

【讨论】：

这对我不起作用。它只会找到“Fetch”的完全匹配

【解决方案2】：

在这里使用正则表达式可能有点过头了，但它允许可能的扩展：

def criterion(tag):
  return tag.has_attr('href') and re.search('Fetch', tag.text)

soup.findAll(criterion)
# [<a href="/downloadsServlet?docid=ixyz" target="_blank">Fetch Report 2 </a>]

【讨论】：

太棒了！我把它改成了一个 lambda 函数。谢谢！
Used soup.findAll(lambda tag: tag.has_attr('href') and re.search('Fetch', tag.text))