正则表达式在网页中查找精确的 pdf 链接答案

【问题标题】：Regular expression to find precise pdf links in a webpage正则表达式在网页中查找精确的 pdf 链接
【发布时间】：2017-02-27 05:56:07
【问题描述】：

给定 url='http://normanpd.normanok.gov/content/daily-activity'，该网站有逮捕、事件和案件摘要三种类型。我被要求使用正则表达式来发现 Python 中所有 Incidents pdf 文档的 URL 字符串。

PDF 将下载到指定位置。

我浏览了链接，发现事件 pdf 文件 URL 的形式为：

normanpd.normanok.gov/filebrowser_download/657/2017-02-19%20Daily%20Incident%20Summary.pdf

我已经写了代码：

import urllib.request

url="http://normanpd.normanok.gov/content/daily-activity"

response = urllib.request.urlopen(url)

data = response.read()      # a `bytes` object
text = data.decode('utf-8')
urls=re.findall(r'(\w|/|-/%)+\sIncident\s(%|\w)+\.pdf$',text)

但在 URL 列表中，值是空的。我是 python3 和正则表达式命令的初学者。谁能帮帮我？

【问题讨论】：

您的正则表达式中有事件，但字符串中没有。有this site帮助python模式，
我忘记添加我得到的文本字符串
如果使用%20 转义空格，如果您正在寻找空格，您希望如何找到该字符串？
我不精通python中的正则表达式我在互联网上阅读了一些内容后编写了正则表达式。我认为 (%|\w)+ 将涵盖介于两者之间的所有 %20 类型。

标签： regex python-3.x web-scraping

【解决方案1】：

这不是一个可取的方法。相反，使用像 bs4 (BeautifulSoup) 这样的 HTML 解析库来查找链接，然后只使用正则表达式来过滤结果。

from urllib.request import urlopen
from bs4 import BeautifulSoup
import re

url="http://normanpd.normanok.gov/content/daily-activity"
response = urlopen(url).read()
soup= BeautifulSoup(response, "html.parser")     
links = soup.find_all('a', href=re.compile(r'(Incident%20Summary\.pdf)'))

for el in links:
    print("http://normanpd.normanok.gov" + el['href'])

输出：

http://normanpd.normanok.gov/filebrowser_download/657/2017-02-23%20Daily%20Incident%20Summary.pdf
http://normanpd.normanok.gov/filebrowser_download/657/2017-02-22%20Daily%20Incident%20Summary.pdf
http://normanpd.normanok.gov/filebrowser_download/657/2017-02-21%20Daily%20Incident%20Summary.pdf
http://normanpd.normanok.gov/filebrowser_download/657/2017-02-20%20Daily%20Incident%20Summary.pdf
http://normanpd.normanok.gov/filebrowser_download/657/2017-02-19%20Daily%20Incident%20Summary.pdf
http://normanpd.normanok.gov/filebrowser_download/657/2017-02-18%20Daily%20Incident%20Summary.pdf
http://normanpd.normanok.gov/filebrowser_download/657/2017-02-17%20Daily%20Incident%20Summary.pdf

但如果你被要求只使用正则表达式，那就试试更简单的方法：

import urllib.request
import re

url="http://normanpd.normanok.gov/content/daily-activity"
response = urllib.request.urlopen(url)
data = response.read()      # a `bytes` object
text = data.decode('utf-8')
urls=re.findall(r'(filebrowser_download.+?Daily%20Incident.+?\.pdf)',text)
print(urls)
for link in urls:
    print("http://normanpd.normanok.gov/" + link)

【讨论】：

你的代码对我有用..我也写了一个正则表达式
谢谢@ettore-rizza..我也写了一个正则表达式 s=re.findall(r'\/file[\w|\/|\-|%]+Incident[\w |%]*\.pdf',v)，虽然对我没有效果
主要是找到一个重复的模式。如果所有文件都以“filebrowser_download”开头并以“.pdf”结尾，为什么要打破你的脑袋？

【解决方案2】：

使用 BeautifulSoup 这是一个简单的方法：

soup = BeautifulSoup(open_page, 'html.parser')
links = []
for link in soup.find_all('a'):
    current = link.get('href')
    if current.endswith('pdf') and "Incident" in current:
        links.append('{0}{1}'.format(url,current))

【讨论】：