【问题标题】:Get image width from HTML code从 HTML 代码中获取图像宽度
【发布时间】:2017-01-25 04:04:06
【问题描述】:

我可以使用BeautifulSoup 获取图像的width 属性,如下所示:

img = soup.find("img")
width = img["width"]

问题是width可以在CSS文件中设置或者根本不设置。

我想在不从img["src"] 下载图像的情况下提取值如果它设置在某处(HTML 或 CSS)或获取浏览器将呈现的默认值(如果没有设置)?

【问题讨论】:

标签: python selenium web-scraping beautifulsoup phantomjs


【解决方案1】:

图片可以部分下载,只够通过设置获取宽/高 请求标头中的范围并使用 getimageinfo.py 的某种变体

示例用法:

def check_is_small_pic(url, pic_size):
    is_small = False
    r_check = requests.get(url, headers={"Range": "50"})
    image_info = getimageinfo.getImageInfo(r_check.content)
    if image_info[1] < pic_size or image_info[2] < pic_size:
        is_small = True
    return is_small

一些getimageinfo.py,为python 3.5快速调整:

import io
import struct
# import urllib.request as urllib2

def getImageInfo(data):
    data = data
    size = len(data)
    #print(size)
    height = -1
    width = -1
    content_type = ''

    # handle GIFs
    if (size >= 10) and data[:6] in (b'GIF87a', b'GIF89a'):
        # Check to see if content_type is correct
        content_type = 'image/gif'
        w, h = struct.unpack(b"<HH", data[6:10])
        width = int(w)
        height = int(h)

    # See PNG 2. Edition spec (http://www.w3.org/TR/PNG/)
    # Bytes 0-7 are below, 4-byte chunk length, then 'IHDR'
    # and finally the 4-byte width, height
    elif ((size >= 24) and data.startswith(b'\211PNG\r\n\032\n')
          and (data[12:16] == b'IHDR')):
        content_type = 'image/png'
        w, h = struct.unpack(b">LL", data[16:24])
        width = int(w)
        height = int(h)

    # Maybe this is for an older PNG version.
    elif (size >= 16) and data.startswith(b'\211PNG\r\n\032\n'):
        # Check to see if we have the right content type
        content_type = 'image/png'
        w, h = struct.unpack(b">LL", data[8:16])
        width = int(w)
        height = int(h)

    # handle JPEGs
    elif (size >= 2) and data.startswith(b'\377\330'):
        content_type = 'image/jpeg'
        jpeg = io.BytesIO(data)
        jpeg.read(2)
        b = jpeg.read(1)
        try:
            while (b and ord(b) != 0xDA):
                while (ord(b) != 0xFF): b = jpeg.read(1)
                while (ord(b) == 0xFF): b = jpeg.read(1)
                if (ord(b) >= 0xC0 and ord(b) <= 0xC3):
                    jpeg.read(3)
                    h, w = struct.unpack(b">HH", jpeg.read(4))
                    break
                else:
                    jpeg.read(int(struct.unpack(b">H", jpeg.read(2))[0])-2)
                b = jpeg.read(1)
            width = int(w)
            height = int(h)
        except struct.error:
            pass
        except ValueError:
            pass

    return content_type, width, height



# from PIL import Image
# import requests
# hrefs = ['http://farm4.staticflickr.com/3894/15008518202_b016d7d289_m.jpg','https://farm4.staticflickr.com/3920/15008465772_383e697089_m.jpg','https://farm4.staticflickr.com/3902/14985871946_86abb8c56f_m.jpg']
# RANGE = 5000
# for href in hrefs:
#     req  = requests.get(href,headers={'User-Agent':'Mozilla5.0(Google spider)','Range':'bytes=0-{}'.format(RANGE)})
#     im = getImageInfo(req.content)
# 
#     print(im)
# req = urllib2.Request("http://vn-sharing.net/forum/images/smilies/onion/ngai.gif", headers={"Range": "5000"})
# r = urllib2.urlopen(req)
# 
# f = open("D:\\Pictures\\1.jpg", "rb")
# print(getImageInfo(r.read()))
# Output: >> ('image/gif', 50, 50)
# print(getImageInfo(f.read()))

【讨论】:

    【解决方案2】:

    快速回答是:你不能 - 图像的最终大小是基于对 CSS 的评估,实际上是 JS。您需要完成所有这些工作才能找到答案。

    另一种方法可能是使用真正的浏览器为您完成这项工作,然后询问它的宽度是多少。请参阅PhantomJSSelenium

    【讨论】:

    • 添加一个如何使用您推荐的示例是个好主意。
    猜你喜欢
    • 2023-01-25
    • 1970-01-01
    • 1970-01-01
    • 2014-07-15
    • 2012-12-21
    • 2012-03-23
    • 2022-11-16
    • 2012-08-15
    • 1970-01-01
    相关资源
    最近更新 更多