【问题标题】:Extracting the src url from img tag using BeautifulSoup使用 BeautifulSoup 从 img 标签中提取 src url
【发布时间】:2020-12-01 13:20:20
【问题描述】:

我正在尝试获取 img src 的 URL 部分。我想提取以下 URL:https://images-na.ssl-images-amazon.com/images/I/41YEd80s6SL._SX384_BO1,204,203,200_.jpg

返回的是我认为是编码图像的以下内容?

数据:图像/ JPEG; BASE64,/ 9J / 4AAQSkZJRgABAQAAAQABAAD / 2wCEABYWGBQYFBwaFhwYHBocIiceGBwgLjg0JzAlNiwsIjYsJTAlIzIsMDouNjA + TkBJPjpnUERYLkRHelJ8ZoZaUnYBDhoYGiAiGh4eIiIeICciRTAgHlIyNDgiSRQ4Hic2Jyk4HCcuMhwpPClJFj4eFFQ6RzIjRScgHiM2JxowNFY2Ov / AABEIARwA3AMBIgACEQEDEQH .... P>

600 多行,我没有全部添加。

这是我的代码:

from bs4 import BeautifulSoup
import requests

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; 64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}

url = "https://www.amazon.co.uk/Django-Professionals-Production-websites-Python/dp/1081582162/ref=sr_1_1?dchild=1&keywords=django+for+professionals&qid=1597167266&sr=8-1"
resp = requests.get(url, headers=headers)
soup = BeautifulSoup(resp.content,features="lxml")
product_title = soup.select("#productTitle")[0].get_text().strip()
author = soup.select(".contributorNameID")[0].get_text().strip()

images = soup.findAll('img')
for image in images:
    print (image['src'])

编辑:其他 img src 似乎与网址一起返回,而不是我专门针对的那个。

【问题讨论】:

    标签: url beautifulsoup extract


    【解决方案1】:

    我相信你可以这样做:

    encoded_image = base64.b64decode(image['src'])
    

    【讨论】:

      【解决方案2】:

      要提取https://images-na.ssl-images-amazon.com/images/I/41YEd80s6SL._SX384_BO1,204,203,200_.jpg图像,可以解析data-a-dynamic-image属性:

      import json
      import requests
      from bs4 import BeautifulSoup
      
      headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; 64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}
      
      url = "https://www.amazon.co.uk/Django-Professionals-Production-websites-Python/dp/1081582162/ref=sr_1_1?dchild=1&keywords=django+for+professionals&qid=1597167266&sr=8-1"
      resp = requests.get(url, headers=headers)
      soup = BeautifulSoup(resp.content,features="lxml")
      product_title = soup.select("#productTitle")[0].get_text().strip()
      author = soup.select(".contributorNameID")[0].get_text().strip()
      
      images = soup.find_all('img', src=lambda s: 'data:' in s)
      for image in images:
          for img in json.loads(image['data-a-dynamic-image']):
              print(img)
      

      打印:

      https://images-na.ssl-images-amazon.com/images/I/41YEd80s6SL._SX384_BO1,204,203,200_.jpg
      https://images-na.ssl-images-amazon.com/images/I/41YEd80s6SL._SX258_BO1,204,203,200_.jpg
      

      【讨论】:

        猜你喜欢
        • 2017-10-14
        • 2018-01-07
        • 2021-07-27
        • 2013-04-06
        • 2016-10-29
        • 2019-06-03
        • 1970-01-01
        • 2018-07-18
        • 1970-01-01
        相关资源
        最近更新 更多