Beautiful Soup - 如何抓取包含特定 src 属性的图像？答案

【问题标题】：Beautiful Soup - How can I scrape images that contain a specific src attribute?Beautiful Soup - 如何抓取包含特定 src 属性的图像？
【发布时间】：2020-04-20 07:33:05
【问题描述】：

几天前我刚刚开始学习网页抓取，并认为尝试将 Mangadex 抓取为一个迷你项目会很有趣。提前感谢您的建议！

我正在尝试通过使用 Beautiful Soup 4 和 Python 3.7 提取 img 标签的 src 属性来抓取图像

我感兴趣的 HTML 部分是：

<div class="reader-image-wrapper col-auto my-auto justify-content-center align-items-center noselect nodrag row no-gutters" data-state="2" data-page="1" style="order: 1;">
  <img draggable="false" class="noselect nodrag cursor-pointer" src="https://s5.mangadex.org/data/554c97a14357f3972912e08817db4a03/x1.jpg">
</div>

我感兴趣的每张图片都包含一个以“https://s5.mangadex.org/data/”开头的 src 属性，所以我想也许我可以定位以该特定属性开头的图片。

我尝试使用 select() 查找 img 元素，然后使用 get() 查找 src，但在特定的 html 部分没有任何运气。

使用 select() 和 get() 的 HTML 部分是：

<img class="mx-2" height="38px" src="/images/misc/navbar.svg?3" alt="MangaDex" title="MangaDex">

<img src="/images/misc/miku.jpg" width="100%">

<img class="mx-2" height="38px" src="/images/misc/navbar.svg?3" alt="MangaDex" title="MangaDex">

【问题讨论】：

你有代码吗？

标签： python html web-scraping beautifulsoup

【解决方案1】：

试试这个：

from bs4 import BeautifulSoup

html = """
      <div class="reader-image-wrapper col-auto my-auto justify-content-center align-items-center noselect nodrag row no-gutters" data-state="2" data-page="1" style="order: 1;">
      <img draggable="false" class="noselect nodrag cursor-pointer" src="https://s5.mangadex.org/data/554c97a14357f3972912e08817db4a03/x1.jpg">
      </div>
      <div class="reader-image-wrapper col-auto my-auto justify-content-center align-items-center noselect nodrag row no-gutters" data-state="2" data-page="1" style="order: 1;">
      <img draggable="false" class="noselect nodrag cursor-pointer" src="https://s4.mangadex.org/data/554c97a14357f3972912e08817db4a03/x1.jpg">
      </div>
       """
soup = BeautifulSoup(html)

for n in soup.find_all('img'):    
    if(n.get('src').startswith( 'https://s5.mangadex.org/data/')):
      print(n.get('src'))

结果：

https://s5.mangadex.org/data/554c97a14357f3972912e08817db4a03/x1.jpg

【讨论】：

【解决方案2】：

attrs 将列出该标记中设置的所有属性。它是一个字典，因此要获取特定的属性值，请参见下文。

# for getting webpages
import requests
r = requests.get(URL_LINK)

base_url='https://s5.mangadex.org/data/'
# for beautiful soup
from bs4 import BeautifulSoup
bs = BeautifulSoup(r.content)
imgs = bs.findAll('img')
for img in imgs:
    src = img.attrs['src']
    if not src.startswith(base_url):
        src = base_url+src
    print(src)

【讨论】：

【解决方案3】：

你不能直接用 BeautifulSoup scrape mangadex。 Mangadex 在文档准备好后使用 javascript 加载他们的图像。使用 BeautifulSoup 得到的是那个空文档。这就是它失败的原因。本网站介绍了如何抓取依赖 javascript 来提供其内容的网页：

https://towardsdatascience.com/data-science-skills-web-scraping-javascript-using-python-97a29738353f

【讨论】：

仅供参考，它是 scrape（和 scraping、scraped、scraper）而不是 scrap