无法在 python 中获取正则表达式以匹配模式答案

【问题标题】：Unable to get regex in python to match pattern无法在 python 中获取正则表达式以匹配模式
【发布时间】：2021-09-17 22:42:11
【问题描述】：

我正在尝试从使用 urllib.request 获得的 HTML 页面的副本中提取一个数字

我在正则表达式中尝试了几种不同的模式，但没有得到任何输出，所以我显然没有正确格式化模式，但无法让它工作

下面是我在字符串中的一小部分 HTML

</ul>\n        \n        <p>* * * * *</p>\n        -->\n        \n        <b>DistroWatch database summary</b><br/>\n        <ul>\n        <li>Number of <a href="search.php?status=All">all distributions</a> in the database: 926<br/>\n        <li>Number of <a href="search.php?status=Active">

我试图从字符串中取出 926，我的代码在下面，我不知道我做错了什么

import urllib.request
import re

page = urllib.request.urlopen('http://distrowatch.com/weekly.php?issue=current')

#print(page.read())
print(page.read())

pageString = str(page.read())
#print(pageString)
DistroCount = re.search('^all distributions</a> in the database: ....<br/>\n$', pageString)

print(DistroCount)

任何帮助、指针或资源建议将不胜感激

【问题讨论】：

试试这个：all distributions</a> in the database: (\d{3})<br/> 和 print(DistroCount.group(1))

标签： python regex

【解决方案1】：

您可以使用BeautifulSoup 将 HTML 转换为文本，然后应用一个简单的正则表达式来提取硬编码字符串后的数字：

import urllib.request, re
from bs4 import BeautifulSoup
page = urllib.request.urlopen('http://distrowatch.com/weekly.php?issue=current')
html = page.read()
soup = BeautifulSoup(html, 'lxml')
text = soup.get_text()
m = re.search(r'all distributions in the database:\s*(\d+)', text)
if m:
    print(m.group(1))

# => 926

这里，

soup.get_text() 将 HTML 转换为纯文本并将其保存在 text 变量中
all distributions in the database:\s*(\d+) 正则表达式匹配 all distributions in the database:，然后是零个或多个空格字符，然后将任何一个或多个数字捕获到第 1 组（使用 (\d+)）

【讨论】：

【解决方案2】：

我认为您的问题是您正在将整个文档读入单个字符串，但是在正则表达式的开头使用“^”并在结尾使用“$”，因此正则表达式只会匹配整个字符串。

要么删除 ^ 和 $（以及 \n...），要么逐行处理您的文档。

【讨论】：