【问题标题】:Python bs4: Get only the URLs that have a certain string in itPython bs4:仅获取其中包含特定字符串的 URL
【发布时间】:2020-12-09 05:21:09
【问题描述】:

我正在制作图像scraper,并希望能够从该链接中获取其中一些照片,然后将它们保存在名为dribblephotos 的文件夹中 :https://dribbble.com/search/shots/popular/illustration?q=sneaker%20

这是我检索到的链接:

https://static.dribbble.com/users/458522/screenshots/6040912/nike_air_huarache_1x.jpg
https://static.dribbble.com/users/458522/avatars/mini/0e524c2621e12569378282793e1ce72b.png?1580329767
https://static.dribbble.com/users/105681/screenshots/3944640/hype_1x.png
https://static.dribbble.com/users/105681/avatars/mini/avatar-01-01.png?1377980605
https://static.dribbble.com/users/923409/screenshots/7179093/basketball_marly_gallardo_1x.jpg
https://static.dribbble.com/users/923409/avatars/mini/bc17b2db165c31804e1cbb1d4159462a.jpg?1596192494
https://static.dribbble.com/users/458522/screenshots/6034458/nike_air_jordan_i_1x.jpg
https://static.dribbble.com/users/458522/avatars/mini/0e524c2621e12569378282793e1ce72b.png?1580329767
https://static.dribbble.com/users/1237425/screenshots/5071294/customize_air_jordan_web_2x.png
https://static.dribbble.com/users/1237425/avatars/mini/87ae45ac7a07dd69fe59985dc51c7f0f.jpeg?1524130139
https://static.dribbble.com/users/1174720/screenshots/6187664/adidas_2x.png
https://static.dribbble.com/users/1174720/avatars/mini/9de08da40078e869f1a680d2e43cdb73.png?1588733495
https://static.dribbble.com/users/179617/screenshots/4426819/ultraboost_1x.png
https://static.dribbble.com/users/179617/avatars/mini/2d545dc6c0dffc930a2b20ca3be88802.jpg?1596735027
https://static.dribbble.com/users/458522/screenshots/6126041/nike_air_max_270_1x.jpg
https://static.dribbble.com/users/458522/avatars/mini/0e524c2621e12569378282793e1ce72b.png?1580329767
https://static.dribbble.com/users/60266/screenshots/6698826/nike_shoe_2x.jpg
https://static.dribbble.com/users/60266/avatars/mini/64826d925db1d4178258d17d8826842b.png?1549028805
https://static.dribbble.com/users/78464/screenshots/4950025/8x600_1x.jpg
https://static.dribbble.com/users/78464/avatars/mini/a9ae6a559ab479d179e8bd22591e4028.jpg?1465908886
https://static.dribbble.com/users/458522/screenshots/6118702/adidas_nmd_r1_1x.jpg
https://static.dribbble.com/users/458522/avatars/mini/0e524c2621e12569378282793e1ce72b.png?1580329767
https://static.dribbble.com/users/458522/screenshots/6098953/nike_lebron_10_je_icon_qs_1x.jpg
https://static.dribbble.com/users/458522/avatars/mini/0e524c2621e12569378282793e1ce72b.png?1580329767
https://static.dribbble.com/users/879147/screenshots/7152093/img_0966_2x.png
https://static.dribbble.com/users/879147/avatars/mini/e095f3837f221bb2ef652dcc966b99f7.jpg?1568473177
https://static.dribbble.com/users/458522/screenshots/6128979/nerd_x_adidas_pharrell_hu_nmd_trail_1x.jpg
https://static.dribbble.com/users/458522/avatars/mini/0e524c2621e12569378282793e1ce72b.png?1580329767
https://static.dribbble.com/users/879147/screenshots/11064235/26fa4a2d-9033-4953-b48f-4c0e8a93fc9d_2x.png
https://static.dribbble.com/users/879147/avatars/mini/e095f3837f221bb2ef652dcc966b99f7.jpg?1568473177
https://static.dribbble.com/users/458522/screenshots/6132938/nike_moon_racer_1x.jpg
https://static.dribbble.com/users/458522/avatars/mini/0e524c2621e12569378282793e1ce72b.png?1580329767
https://static.dribbble.com/users/1823684/screenshots/5973495/jordannn1_2x.png
https://static.dribbble.com/users/1823684/avatars/mini/f6041c082aec67302d4b78b8d203f02b.png?1509719582
https://static.dribbble.com/users/552027/screenshots/4666241/airmax270_1x.jpg
https://static.dribbble.com/users/552027/avatars/mini/35bb0dcb5a6619f68816290898bff6cc.jpg?1535884243
https://static.dribbble.com/users/458522/screenshots/6044426/adidas_pharrell_hu_nmd_trail_1x.jpg
https://static.dribbble.com/users/458522/avatars/mini/0e524c2621e12569378282793e1ce72b.png?1580329767
https://static.dribbble.com/users/220914/screenshots/11295053/woman_shoe_tree_floating2_2x.png
https://static.dribbble.com/users/220914/avatars/mini/d364a9c166edb6d96cc059a836219a7d.jpg?1590773568
https://static.dribbble.com/users/4040486/screenshots/7079508/___2x.png
https://static.dribbble.com/users/4040486/avatars/mini/f31e9b50df877df815177e2015135ff7.png?1582521697
https://static.dribbble.com/users/57602/screenshots/12909636/d2_2x.png
https://static.dribbble.com/users/57602/avatars/mini/b4c27f3be2c61d82fbc821433d058b04.jpg?1575089000
https://static.dribbble.com/users/458522/screenshots/6049522/nike_x_john_elliott_lebron_10_soldier_1x.jpg
https://static.dribbble.com/users/458522/avatars/mini/0e524c2621e12569378282793e1ce72b.png?1580329767
https://static.dribbble.com/users/1025917/screenshots/9738550/vans-2020-pixelwolfie-dribbble_2x.png
https://static.dribbble.com/users/1025917/avatars/mini/87fdcb145eab0b47eda29fc873f25f8c.png?1594466719
https://static.dribbble.com/assets/icon-backtotop-1b04df73090f6b0f3192a3b71874ca3b3cc19dff16adc6cf365cd0c75897f6c0.png
https://static.dribbble.com/assets/dribbble-ball-icon-e94956d5f010d19607348176b0ae90def55d61871a43cb4bcb6d771d8d235471.svg
https://static.dribbble.com/assets/icon-shot-x-light-40c073cd65443c99d4ac129b69bf578c8cf97d69b78990c00c4f8c5873b0d601.png
https://static.dribbble.com/assets/icon-shot-prev-light-ca583c76838d54eca11832ebbcaba09ba8b2bf347de2335341d244ecb9734593.png
https://static.dribbble.com/assets/icon-shot-next-light-871a18220c4c5a0325d1353f8e4cc204c3b49beacc63500644556faf25ded617.png
https://static.dribbble.com/assets/dribbble-square-c8c7a278e96146ee5a9b60c3fa9eeba58d2e5063793e2fc5d32366e1b34559d3.png
https://static.dribbble.com/assets/dribbble-ball-192-ec064e49e6f63d9a5fa911518781bee0c90688d052a038f8876ef0824f65eaf2.png
https://static.dribbble.com/assets/icon-overlay-x-2x-b7df2526b4c26d4e8410a7c437c433908be0c7c8c3c3402c3e578af5c50cf5a5.png

但是,我只希望能够获取其中包含字符串“screenshots”的 URL。所以,我尝试制作一个函数来抓取某些在其 URL 中具有“屏幕截图”的图像。比如:

https://static.dribbble.com/users/923409/screenshots/7179093/basketball_marly_gallardo_1x.jpg

起初,为了看看是否有效,我创建了一个函数来打印我想要的特定链接。然而它没有用。这是我的功能代码:

def art_links():
    images = []
    for img in x:
        images.append(img['src'])
    images = soup2.find_all("screenshots")
    print(images)

这是我的完整代码:

from bs4 import BeautifulSoup
import requests as rq 
import os 

r2 = rq.get("https://dribbble.com/search/shots/popular/illustration?q=sneaker%20")
soup2 = BeautifulSoup(r2.text, "html.parser")

links = []

x = soup2.select('img[src^="https://static.dribbble.com"]')

for img in x: 
    links.append(img['src'])

def art_links():
    images = []
    for img in x:
        images.append(img['src'])
    images = soup2.find_all("screenshots")
    print(images)
    

os.mkdir('dribblephotos') 


for index, img_link in enumerate(links):
    if "screenshots" in images:
    img_data = r.get(img_link).content
        with open("dribblephotos/" + str(index + 1) + '.jpg', 'wb+') as f:
            f.write(img_data)
        
    else:
        break
art_links()

【问题讨论】:

    标签: python web-scraping beautifulsoup


    【解决方案1】:

    我注意到最后的 if 语句(不是在 if 下加标签)的代码语法有一点问题,所以我重新格式化了一下,试图让它变成你想要的.我认为可能发生的事情是你在最后的 for 循环之外打破了 else 语句。这使得只要一个条目在链接中没有屏幕截图,它就会完全停止循环而不是继续。虽然可以使用关键字“继续”,但不使用 else 语句就足够了。您还在检查图像中的“屏幕截图”,但您尝试检查的链接名称在 for 循环中声明为 img_link。最后在你的 for 循环中试试这个,看看你会得到什么:

    for index, img_link in enumerate(links):
    if "screenshots" in img_link:
        img_data = rq.get(img_link).content
        with open("dribblephotos/" + str(index + 1) + '.jpg', 'wb+') as f:
            f.write(img_data)
    

    如果您仍然需要链接而不是文件下载,您应该能够在 for 循环中遍历图像时检索它们,并将它们存储在新列表中(如果它是屏幕截图链接)。

    更新: 这个最新的对我有用。在将它们放入循环后,我删除了过滤掉 ips 的功能,因为在已经循环了两次之后这是不必要的。第一个 for 循环就是你所需要的,不需要迭代两次,所以我只检查第一次迭代的时候,如果需要,只保存到链接列表的链接。

    from bs4 import BeautifulSoup
    import requests as rq
    import os
    
    r2 = rq.get("https://dribbble.com/search/shots/popular/illustration?q=sneaker%20")
    soup2 = BeautifulSoup(r2.text, "html.parser")
    
    links = []
    
    x = soup2.select('img[src^="https://static.dribbble.com"]')
    
    os.mkdir('dribblephotos')
    
    # Only one for loop required, shouldn't iterate twice if not required
    for index, img in enumerate(x):
        # Store the current url from the image result
        url = img["src"]
        # Check the url for screenshot before putting in the links
        if "screenshot" in url:
            links.append(img['src'])
            # Download the image
            img_data = rq.get(url).content
            # Put the image into the file
            with open("dribblephotos/" + str(index + 1) + '.jpg', 'wb+') as f:
                f.write(img_data)
    
    print(links)
    

    【讨论】:

    • @BrandonLodwick 感谢您的帮助,我尝试了您推荐的方法。但是,打印出来的是一个空列表[] 那么如何更改我的 if 语句以检查我的 for 循环中声明的 img_link?
    • 文件也没有下载@BrandonLodwick
    • 很奇怪,它是在我的网站上下载的。我把for循环后面的函数调用去掉了,我再看一遍
    • 我也删除了函数调用,它没有下载任何文件,它只创建了文件夹@BrandonLodwick
    • 啊小!感谢您坚持我并帮助我。我希望你有一个美好的休息一天。 @BrandonLiodwick
    猜你喜欢
    • 1970-01-01
    • 2017-08-02
    • 2021-03-21
    • 1970-01-01
    • 2015-01-20
    • 1970-01-01
    • 2012-10-19
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多