在大字符串中搜索文件路径。返回文件路径+文件名答案

【问题标题】：Searching Large String for file path. Return filepath + filename在大字符串中搜索文件路径。返回文件路径+文件名
【发布时间】：2015-08-15 17:48:23
【问题描述】：

我有一个小项目，我正在尝试从网页下载一系列壁纸。我是 python 新手。

我正在使用urllib 库，它返回一长串网页数据，其中包括

<a href="http://website.com/wallpaper/filename.jpg">

我知道我需要下载的每个文件名都有

'http://website.com/wallpaper/'

如何在页面源中搜索这部分文本，并返回图像链接的其余部分，以“*.jpg”扩展名结尾？

r'http://website.com/wallpaper/ xxxxxx .jpg'

我在想是否可以格式化一个不计算 xxxx 部分的正则表达式？只需检查路径和 .jpg 扩展名。然后在找到匹配项后返回整个字符串

我在正确的轨道上吗？

【问题讨论】：

您可以使用regex，但不要。也许BeautifulSoup

标签： python regex string beautifulsoup html-parsing

【解决方案1】：

我认为一个非常基本的正则表达式就可以了。
喜欢：

(http:\/\/website\.com\/wallpaper\/[\w\d_-]*?\.jpg)

如果你使用$1this 将返回整个字符串。

如果你使用

(http:\/\/website\.com\/wallpaper\/([\w\d_-]*?)\.jpg)

然后$1 将给出整个字符串，$2 将只给出文件名。

注意：转义 (\/) 取决于语言，因此请使用 python 支持的内容。

【讨论】：

【解决方案2】：

BeautifulSoup 做这类事情很方便。

import re
import urllib3
from bs4 import BeautifulSoup

jpg_regex = re.compile('\.jpg$')
site_regex = re.compile('website\.com\/wallpaper\/')

pool = urllib3.PoolManager()
request = pool.request('GET', 'http://your_website.com/')
soup = BeautifulSoup(request)

jpg_list = list(soup.find_all(name='a', attrs={'href':jpg_regex}))
site_list = list(soup.find_all(name='a', attrs={'href':site_regex}))

result_list = map(lambda a: a.get('href'), jpg_list and site_list)

【讨论】：

感谢您的回复。 Beautifulsoup 两票，我想我需要检查一下。我在我的 google-fu 中遇到过它，但不知道如何使用它

【解决方案3】：

不要对 HTML 使用正则表达式。

改为使用 HTML 解析库。

BeautifulSoup 是一个用于解析 HTML 的库，urllib2 是一个用于获取 URL 的内置模块

import urllib2
from bs4 import BeautifulSoup as bs

content = urllib2.urlopen('http://website.com/wallpaper/index.html').read()
html = bs(content)
links = [] # an empty list

for link in html.find_all('a'):
   href = link.get('href')
   if '/wallpaper/' in href:
      links.append(href)

【讨论】：

感谢您的回复。 Beautifulsoup 两票，我想我需要检查一下。我在我的 google-fu 中遇到过它，但不知道如何使用它

【解决方案4】：

在url中搜索“http://website.com/wallpaper/”子串，然后在url中查找“.jpg”，如下图：

domain = "http://website.com/wallpaper/"
url = str("your URL")
format = ".jpg"
for domain in url and format in url:
    //do something

【讨论】：