HREF 值使用 BS4 搜索网页答案

【问题标题】：HREF values search through the web page using BS4HREF 值使用 BS4 搜索网页
【发布时间】：2013-01-08 15:59:34
【问题描述】：

我正在开发第 3 方应用程序，我已经阅读了网页源内容的视图。从那里我们只需要收集一些 href 内容值，其模式类似于 /aems/file/filegetrevision.do?fileEntityId。可能吗？我的一个给了我所有的href 值。

HTML *（HTML 的一部分）*

<td width="50%">
<a href="/aems/file/filegetrevision.do?fileEntityId=10597525&cs=9b7sjueBiWLBEMj2ZU4I6fyQoPv-g0NLY9ETqP0gWk4.xyz">
screenshot.doc
</a>
</td>

代码

for a in soup.find_all('a', {"style": "display:inline; position:relative;"}, href=True):
    href = a['href'].strip()
    href = "https://xyz.test.com/" + href
print(href)

谢谢

谢谢，

【问题讨论】：

@CRUSADER 是的，我尝试过，但没有成功。供您在上面找到！

标签： python beautifulsoup

【解决方案1】：

是的，只需为href 属性使用适当的过滤器。喜欢

def filter(href):
    return '/aems/file/filegetrevision' in href

soup.find_all('a', href=filter)

除了函数，您还可以使用RegexObject 对象作为过滤器：

filter = re.compile(some_regular_expression)
soup.find_all('a', href=filter)

查看文档：Kind of filters

【讨论】：

在这种情况下我可能会使用href.startswith('...')。正则表达式示例不应该是 re.compile('...').match 还是 partial(re.match, '...') ？
@JonClements 不需要，BS groks RegexObjects 和 callables 一样。尽管为startswith +1，我不确定 OP 到底想要什么过滤器，但它可能会很方便。
你是对的（出于某种原因，我认为它没有这样做 - 我可能会将它与另一个库混淆） - If you pass in a regular expression object, Beautiful Soup will filter against that regular expression using its match() method. （为了清楚起见，可能值得将其添加到您的答案中) - 无论如何 +1。
@Kos 我想根据模式/aems/file/filegetrevision.do?fileEntityId 从网页中取出所有href 值。正如您在回答中提到的那样，我认为您正确理解了它。 :)