【问题标题】:Python requests scrape image returns src in format "data:image/"Python 请求抓取图像以“data:image/”格式返回 src
【发布时间】:2021-01-20 17:52:51
【问题描述】:

我正在尝试从谷歌图片搜索结果中抓取第一张图片,因为我不想为 100 个关键字手动执行此操作。

使用此代码:

from bs4 import BeautifulSoup
import requests
import json


query="koko"
url = "https://www.google.com/search?q=" + str(query) + "&source=lnms&tbm=isch"
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36"}


html = requests.get(url, headers=headers).text

soup = BeautifulSoup(html, 'html.parser')
images = soup.findAll("img")

images[0]<img alt="Koko, the gorilla who knew sign language, dies at 46 - Chicago Tribune" class="rg_i Q4LuWd" data-deferred="1" data-iid="0" height="157" jsname="Q4LuWd" src="data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" width="200"/>

返回的src 具有这种格式,我认为这是我不想要的base64,我想要一个普通的图像链接。

如果我在我的 chrome 浏览器上禁用 javascript 并导航到 https://www.google.com/search?q=koko&source=lnms&tbm=isch 并查看源代码,则返回的 img 的 src 是我需要的正常格式。

我无法使 requests html 与禁用的 javascript chrome 相同。

我尝试更改我的 User-Agent 并尝试将我的用户代理与 chrome 匹配,但它不会改变结果。

【问题讨论】:

    标签: python beautifulsoup python-requests


    【解决方案1】:

    要获取所有图像,请设置content-type header

    from bs4 import BeautifulSoup
    import requests
    
    
    query = "koko"
    url = "https://www.google.com/search?q=" + str(query) + "&source=lnms&tbm=isch"
    
    HEADERS = {"content-type": "image/png"}
    
    html = requests.get(url, headers=HEADERS).text
    
    soup = BeautifulSoup(html, "html.parser")
    
    for img in soup.find_all("img"):
        print(img["src"])
    

    输出:

    /images/branding/searchlogo/1x/googlelogo_desk_heirloom_color_150x55dp.gif
    https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcSYjINUgXtYyrUB4fKyaVxXCAkSyc_Q5b0QaeohUxmjdiIQwS_9CPXgWCXrUGQ&s
    https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcR1UnMwOo_8tpFkm04yby_I0HdMbfh6-GnhVWnKhOF1qnSP4ogODEn3AAo7V0M&s
    https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcSHwKA_l2i20z0yeGMr_imQcB-tffAfL0xcQAKmbFn1-NtVrHn8AtTv9aql2Q&s
    https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcR6UNEOYT2BwMVrjXo8WW6CS0rUHC0QLIqA-GdO1CLGk7mxw8lhWgMyI-uW4A&s
    https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcT0dQsIKidzCcvdpvL0FDIfZ4Q3WL8GUKCCbwnK4V7FJ6nCGDVNbFmhnD7eOJ8&s
    https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcSeeYoW5maZW69VamkrN_vzjQoxIQl-RFrcZK58rCry1ZDpyIT6FVaG1IFsKw&s
    https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcS3wy4vKh6ey8SAZHRxe-sKa1LEiBBdk6cbjELSGkoQn1YINb_YZSRanpOzR38&s
    https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcSnw0tBokCloEzt0QDpnTVvJYJr1ZDngx7Znz6nLCbjZbq2Vn3g57iEUKordQ&s
    https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcTZsq7Dy3-bT8miOPD_GE8_1X3isDl67A1ucNauliVlV4dIWgqleLY1OFyLjw&s
    https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcQsjLjEmJ2kFrdoiU1O0CE_d2bazVxl4IPaHJy2Ea_PhI-B0_4jXcDcuLo2PQ&s
    https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcRlvVOp05edZGkjz6q3QN8vqPsC-h-lIRlFyU16wYefNRG3zVlFQ2XeJRH3mMU&s
    https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcT1QKRBEW1WOZs-bS15vTjzYutHLYNIis6Ji60bcJ_mXvA1tYjYYrD-Nk9cWMc&s
    https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcQiqRS7ry4rNx8VNA4F6TUmm_ZaTtcp4iXokZF_WT-M7zEkF9YG7PpWKpPhSg&s
    https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcSNtY7c7Qg-w9wXmKfhSHrop5b4tb2wCQoK5pLj_RA1eCPXAn4TNNtEVA8RG_U&s
    https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcTH8zHuDssfuFW0PUpqNnQoG0yTkebQ194uy7auEzzodGuSAYqsF8flYTW3VAE&s
    https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcSDowATaKwsMkiN1aQj9e6J2VfMUm6742KW3ifxqddk4UHWSX-WOWDeTDSi_w&s
    https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcTaCZKWiYg2tEUNerLa1zcmUD25-ZVC0RCDY1E1iby3PnHIJOY7cFhTZd8Em8M&s
    https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcTfg8euHcq0wcUrtIHxleulXlTzbuehiZBb1DgJTEs3GdiG5l5bTdRt0Ug-Qg&s
    https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcRPbAOCCA3diC-W5CtqbmpegeWPw-ReQPxBDaHN2YPH6OIqWC16dj5uNbhXhw&s
    https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcSnrICqNqL_KG42rZ2_B7nKdZr-INrqsdZqfzeAbFrJYsBez0GDvKtIrwJjP5U&s
    

    【讨论】:

    • 很好地使用标题!
    【解决方案2】:

    MendelG 的免费回答。虽然 MendelG 展示了对 headers 的大量使用,但如果你想提取 100 张图片,那么这种方法是行不通的。尝试应用ijn=0 (batch of 100 images) URL 参数来请求,你会看到只有这块 HTML 会被返回:

    <div id="gsamd" style="display:none">[]</div><div data-piv="1" id="is_gsa_flags" style="display:none"></div>
    

    此外,headers 方法只会返回缩略图 URL 和图像元数据(标题、链接)。但是标题是相同的,而 URL 则不同。

    要提取更多图像,您需要找到所有&lt;script&gt; 标签,然后使用regex 匹配、提取和解码所有提取的URL。

    抓取缩略图/原始尺寸图像和full example in the online IDE 的代码(尝试慢慢阅读):

    import requests, lxml, re, json
    from bs4 import BeautifulSoup
    
    headers = {
        "User-Agent":
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
    }
    
    params = {
        "q": "pexels cat",
        "tbm": "isch", 
        "hl": "en",
        "ijn": "0",
    }
    
    html = requests.get("https://www.google.com/search", params=params, headers=headers)
    soup = BeautifulSoup(html.text, 'lxml')
    
    
    def get_images_data():
    
        # this steps could be refactored to a more compact
        all_script_tags = soup.select('script')
    
        # https://regex101.com/r/48UZhY/4
        matched_images_data = ''.join(re.findall(r"AF_initDataCallback\(([^<]+)\);", str(all_script_tags)))
        
        # https://kodlogs.com/34776/json-decoder-jsondecodeerror-expecting-property-name-enclosed-in-double-quotes
        # if you try to json.loads() without json.dumps it will throw an error:
        # "Expecting property name enclosed in double quotes"
        matched_images_data_fix = json.dumps(matched_images_data)
        matched_images_data_json = json.loads(matched_images_data_fix)
    
        # https://regex101.com/r/pdZOnW/3
        matched_google_image_data = re.findall(r'\[\"GRID_STATE0\",null,\[\[1,\[0,\".*?\",(.*),\"All\",', matched_images_data_json)
    
        # https://regex101.com/r/NnRg27/1
        matched_google_images_thumbnails = ', '.join(
            re.findall(r'\[\"(https\:\/\/encrypted-tbn0\.gstatic\.com\/images\?.*?)\",\d+,\d+\]',
                       str(matched_google_image_data))).split(', ')
    
        print('Google Image Thumbnails:')  # in order
        for fixed_google_image_thumbnail in matched_google_images_thumbnails:
            # https://stackoverflow.com/a/4004439/15164646 comment by Frédéric Hamidi
            google_image_thumbnail_not_fixed = bytes(fixed_google_image_thumbnail, 'ascii').decode('unicode-escape')
    
            # after first decoding, Unicode characters are still present. After the second iteration, they were decoded.
            google_image_thumbnail = bytes(google_image_thumbnail_not_fixed, 'ascii').decode('unicode-escape')
            print(google_image_thumbnail)
    
        # removing previously matched thumbnails for easier full resolution image matches.
        removed_matched_google_images_thumbnails = re.sub(
            r'\[\"(https\:\/\/encrypted-tbn0\.gstatic\.com\/images\?.*?)\",\d+,\d+\]', '', str(matched_google_image_data))
    
        # https://regex101.com/r/fXjfb1/4
        # https://stackoverflow.com/a/19821774/15164646
        matched_google_full_resolution_images = re.findall(r"(?:'|,),\[\"(https:|http.*?)\",\d+,\d+\]",
                                                           removed_matched_google_images_thumbnails)
    
        print('\nGoogle Full Resolution Images:')  # in order
        for fixed_full_res_image in matched_google_full_resolution_images:
            # https://stackoverflow.com/a/4004439/15164646 comment by Frédéric Hamidi
            original_size_img_not_fixed = bytes(fixed_full_res_image, 'ascii').decode('unicode-escape')
            original_size_img = bytes(original_size_img_not_fixed, 'ascii').decode('unicode-escape')
            print(original_size_img)
    
    get_images_data()
    
    --------------
    '''
    Google Image Thumbnails:
    https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcSb48h3zks_bf6y7HnZGyGPn3s2TAHKKm_7kzxufi5nzbouJcQderHqoEoOZ4SpOuPDjfw&usqp=CAU
    ...
    
    Google Full Resolution Images:
    https://images.pexels.com/photos/1170986/pexels-photo-1170986.jpeg?auto=compress&cs=tinysrgb&dpr=1&w=500
    ...
    '''
    

    或者,您可以使用来自 SerpApi 的 Goole Images API 来实现此目的。这是一个带有免费计划的付费 API。查看playground 看看输出是什么样子的。

    主要区别在于您不必使用 regex 或其他任何东西来从 HTML 中提取数据或绕过 Google 的块(如果它们出现),所有这些都需要要做的就是迭代结构化的 JSON 并获得你想要的数据。

    示例代码:

    import os, json # json for pretty output
    from serpapi import GoogleSearch
    
    
    def get_google_images():
        params = {
          "api_key": os.getenv("API_KEY"), # environment variable
          "engine": "google",
          "q": "pexels cat",
          "tbm": "isch"
        }
    
        search = GoogleSearch(params)
        results = search.get_dict()
    
        print(json.dumps(results['images_results'], indent=2, ensure_ascii=False))
    
    get_google_images()
    
    ----------
    '''
    ...
      {
        "position": 60, # img number
        "thumbnail": "https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcRt-tXSZMBNLLX8MhavbBNkKmjJ7wNXxtdr5Q&usqp=CAU",
        "source": "pexels.com",
        "title": "1,000+ Best Cats Videos · 100% Free Download · Pexels Stock Videos",
        "link": "https://www.pexels.com/search/videos/cats/",
        "original": "https://images.pexels.com/videos/855282/free-video-855282.jpg?auto=compress&cs=tinysrgb&dpr=1&w=500",
        "is_product": false
      }
    ...
    '''
    

    查看 Content-Typemedia type 标头的作用。更多关于base64 and data urls

    附: - 我写了一篇更深入的博客文章,其中包含有关如何抓取 Google Images 的可视化表示。

    免责声明,我为 SerpApi 工作。

    【讨论】:

      猜你喜欢
      • 2014-04-15
      • 1970-01-01
      • 1970-01-01
      • 2020-06-01
      • 1970-01-01
      • 2021-01-26
      • 1970-01-01
      • 2022-08-18
      • 1970-01-01
      相关资源
      最近更新 更多