Python bs4：仅获取其中包含特定字符串的 URL答案

【问题标题】：Python bs4: Get only the URLs that have a certain string in itPython bs4：仅获取其中包含特定字符串的 URL
【发布时间】：2020-12-09 05:21:09
【问题描述】：

我正在制作图像scraper，并希望能够从该链接中获取其中一些照片，然后将它们保存在名为dribblephotos 的文件夹中：https://dribbble.com/search/shots/popular/illustration?q=sneaker%20

这是我检索到的链接：

https://static.dribbble.com/users/458522/screenshots/6040912/nike_air_huarache_1x.jpg
https://static.dribbble.com/users/458522/avatars/mini/0e524c2621e12569378282793e1ce72b.png?1580329767
https://static.dribbble.com/users/105681/screenshots/3944640/hype_1x.png
https://static.dribbble.com/users/105681/avatars/mini/avatar-01-01.png?1377980605
https://static.dribbble.com/users/923409/screenshots/7179093/basketball_marly_gallardo_1x.jpg
https://static.dribbble.com/users/923409/avatars/mini/bc17b2db165c31804e1cbb1d4159462a.jpg?1596192494
https://static.dribbble.com/users/458522/screenshots/6034458/nike_air_jordan_i_1x.jpg
https://static.dribbble.com/users/458522/avatars/mini/0e524c2621e12569378282793e1ce72b.png?1580329767
https://static.dribbble.com/users/1237425/screenshots/5071294/customize_air_jordan_web_2x.png
https://static.dribbble.com/users/1237425/avatars/mini/87ae45ac7a07dd69fe59985dc51c7f0f.jpeg?1524130139
https://static.dribbble.com/users/1174720/screenshots/6187664/adidas_2x.png
https://static.dribbble.com/users/1174720/avatars/mini/9de08da40078e869f1a680d2e43cdb73.png?1588733495
https://static.dribbble.com/users/179617/screenshots/4426819/ultraboost_1x.png
https://static.dribbble.com/users/179617/avatars/mini/2d545dc6c0dffc930a2b20ca3be88802.jpg?1596735027
https://static.dribbble.com/users/458522/screenshots/6126041/nike_air_max_270_1x.jpg
https://static.dribbble.com/users/458522/avatars/mini/0e524c2621e12569378282793e1ce72b.png?1580329767
https://static.dribbble.com/users/60266/screenshots/6698826/nike_shoe_2x.jpg
https://static.dribbble.com/users/60266/avatars/mini/64826d925db1d4178258d17d8826842b.png?1549028805
https://static.dribbble.com/users/78464/screenshots/4950025/8x600_1x.jpg
https://static.dribbble.com/users/78464/avatars/mini/a9ae6a559ab479d179e8bd22591e4028.jpg?1465908886
https://static.dribbble.com/users/458522/screenshots/6118702/adidas_nmd_r1_1x.jpg
https://static.dribbble.com/users/458522/avatars/mini/0e524c2621e12569378282793e1ce72b.png?1580329767
https://static.dribbble.com/users/458522/screenshots/6098953/nike_lebron_10_je_icon_qs_1x.jpg
https://static.dribbble.com/users/458522/avatars/mini/0e524c2621e12569378282793e1ce72b.png?1580329767
https://static.dribbble.com/users/879147/screenshots/7152093/img_0966_2x.png
https://static.dribbble.com/users/879147/avatars/mini/e095f3837f221bb2ef652dcc966b99f7.jpg?1568473177
https://static.dribbble.com/users/458522/screenshots/6128979/nerd_x_adidas_pharrell_hu_nmd_trail_1x.jpg
https://static.dribbble.com/users/458522/avatars/mini/0e524c2621e12569378282793e1ce72b.png?1580329767
https://static.dribbble.com/users/879147/screenshots/11064235/26fa4a2d-9033-4953-b48f-4c0e8a93fc9d_2x.png
https://static.dribbble.com/users/879147/avatars/mini/e095f3837f221bb2ef652dcc966b99f7.jpg?1568473177
https://static.dribbble.com/users/458522/screenshots/6132938/nike_moon_racer_1x.jpg
https://static.dribbble.com/users/458522/avatars/mini/0e524c2621e12569378282793e1ce72b.png?1580329767
https://static.dribbble.com/users/1823684/screenshots/5973495/jordannn1_2x.png
https://static.dribbble.com/users/1823684/avatars/mini/f6041c082aec67302d4b78b8d203f02b.png?1509719582
https://static.dribbble.com/users/552027/screenshots/4666241/airmax270_1x.jpg
https://static.dribbble.com/users/552027/avatars/mini/35bb0dcb5a6619f68816290898bff6cc.jpg?1535884243
https://static.dribbble.com/users/458522/screenshots/6044426/adidas_pharrell_hu_nmd_trail_1x.jpg
https://static.dribbble.com/users/458522/avatars/mini/0e524c2621e12569378282793e1ce72b.png?1580329767
https://static.dribbble.com/users/220914/screenshots/11295053/woman_shoe_tree_floating2_2x.png
https://static.dribbble.com/users/220914/avatars/mini/d364a9c166edb6d96cc059a836219a7d.jpg?1590773568
https://static.dribbble.com/users/4040486/screenshots/7079508/___2x.png
https://static.dribbble.com/users/4040486/avatars/mini/f31e9b50df877df815177e2015135ff7.png?1582521697
https://static.dribbble.com/users/57602/screenshots/12909636/d2_2x.png
https://static.dribbble.com/users/57602/avatars/mini/b4c27f3be2c61d82fbc821433d058b04.jpg?1575089000
https://static.dribbble.com/users/458522/screenshots/6049522/nike_x_john_elliott_lebron_10_soldier_1x.jpg
https://static.dribbble.com/users/458522/avatars/mini/0e524c2621e12569378282793e1ce72b.png?1580329767
https://static.dribbble.com/users/1025917/screenshots/9738550/vans-2020-pixelwolfie-dribbble_2x.png
https://static.dribbble.com/users/1025917/avatars/mini/87fdcb145eab0b47eda29fc873f25f8c.png?1594466719
https://static.dribbble.com/assets/icon-backtotop-1b04df73090f6b0f3192a3b71874ca3b3cc19dff16adc6cf365cd0c75897f6c0.png
https://static.dribbble.com/assets/dribbble-ball-icon-e94956d5f010d19607348176b0ae90def55d61871a43cb4bcb6d771d8d235471.svg
https://static.dribbble.com/assets/icon-shot-x-light-40c073cd65443c99d4ac129b69bf578c8cf97d69b78990c00c4f8c5873b0d601.png
https://static.dribbble.com/assets/icon-shot-prev-light-ca583c76838d54eca11832ebbcaba09ba8b2bf347de2335341d244ecb9734593.png
https://static.dribbble.com/assets/icon-shot-next-light-871a18220c4c5a0325d1353f8e4cc204c3b49beacc63500644556faf25ded617.png
https://static.dribbble.com/assets/dribbble-square-c8c7a278e96146ee5a9b60c3fa9eeba58d2e5063793e2fc5d32366e1b34559d3.png
https://static.dribbble.com/assets/dribbble-ball-192-ec064e49e6f63d9a5fa911518781bee0c90688d052a038f8876ef0824f65eaf2.png
https://static.dribbble.com/assets/icon-overlay-x-2x-b7df2526b4c26d4e8410a7c437c433908be0c7c8c3c3402c3e578af5c50cf5a5.png

但是，我只希望能够获取其中包含字符串“screenshots”的 URL。所以，我尝试制作一个函数来抓取某些在其 URL 中具有“屏幕截图”的图像。比如：

https://static.dribbble.com/users/923409/screenshots/7179093/basketball_marly_gallardo_1x.jpg

起初，为了看看是否有效，我创建了一个函数来打印我想要的特定链接。然而它没有用。这是我的功能代码：

def art_links():
    images = []
    for img in x:
        images.append(img['src'])
    images = soup2.find_all("screenshots")
    print(images)

这是我的完整代码：

from bs4 import BeautifulSoup
import requests as rq 
import os 

r2 = rq.get("https://dribbble.com/search/shots/popular/illustration?q=sneaker%20")
soup2 = BeautifulSoup(r2.text, "html.parser")

links = []

x = soup2.select('img[src^="https://static.dribbble.com"]')

for img in x: 
    links.append(img['src'])

def art_links():
    images = []
    for img in x:
        images.append(img['src'])
    images = soup2.find_all("screenshots")
    print(images)
    

os.mkdir('dribblephotos') 


for index, img_link in enumerate(links):
    if "screenshots" in images:
    img_data = r.get(img_link).content
        with open("dribblephotos/" + str(index + 1) + '.jpg', 'wb+') as f:
            f.write(img_data)
        
    else:
        break
art_links()

【问题讨论】：

标签： python web-scraping beautifulsoup

【解决方案1】：

我注意到最后的 if 语句（不是在 if 下加标签）的代码语法有一点问题，所以我重新格式化了一下，试图让它变成你想要的.我认为可能发生的事情是你在最后的 for 循环之外打破了 else 语句。这使得只要一个条目在链接中没有屏幕截图，它就会完全停止循环而不是继续。虽然可以使用关键字“继续”，但不使用 else 语句就足够了。您还在检查图像中的“屏幕截图”，但您尝试检查的链接名称在 for 循环中声明为 img_link。最后在你的 for 循环中试试这个，看看你会得到什么：

for index, img_link in enumerate(links):
if "screenshots" in img_link:
    img_data = rq.get(img_link).content
    with open("dribblephotos/" + str(index + 1) + '.jpg', 'wb+') as f:
        f.write(img_data)

如果您仍然需要链接而不是文件下载，您应该能够在 for 循环中遍历图像时检索它们，并将它们存储在新列表中（如果它是屏幕截图链接）。

更新：这个最新的对我有用。在将它们放入循环后，我删除了过滤掉 ips 的功能，因为在已经循环了两次之后这是不必要的。第一个 for 循环就是你所需要的，不需要迭代两次，所以我只检查第一次迭代的时候，如果需要，只保存到链接列表的链接。

from bs4 import BeautifulSoup
import requests as rq
import os

r2 = rq.get("https://dribbble.com/search/shots/popular/illustration?q=sneaker%20")
soup2 = BeautifulSoup(r2.text, "html.parser")

links = []

x = soup2.select('img[src^="https://static.dribbble.com"]')

os.mkdir('dribblephotos')

# Only one for loop required, shouldn't iterate twice if not required
for index, img in enumerate(x):
    # Store the current url from the image result
    url = img["src"]
    # Check the url for screenshot before putting in the links
    if "screenshot" in url:
        links.append(img['src'])
        # Download the image
        img_data = rq.get(url).content
        # Put the image into the file
        with open("dribblephotos/" + str(index + 1) + '.jpg', 'wb+') as f:
            f.write(img_data)

print(links)

【讨论】：

@BrandonLodwick 感谢您的帮助，我尝试了您推荐的方法。但是，打印出来的是一个空列表[] 那么如何更改我的 if 语句以检查我的 for 循环中声明的 img_link？
文件也没有下载@BrandonLodwick
很奇怪，它是在我的网站上下载的。我把for循环后面的函数调用去掉了，我再看一遍
我也删除了函数调用，它没有下载任何文件，它只创建了文件夹@BrandonLodwick
啊小！感谢您坚持我并帮助我。我希望你有一个美好的休息一天。 @BrandonLiodwick