【问题标题】:Issue requesting when scraping images from google using 'src' tag, how to scrape images from google using beautiful soup?使用“src”标签从谷歌抓取图像时发出请求,如何使用美丽的汤从谷歌抓取图像?
【发布时间】:2021-10-16 16:09:03
【问题描述】:

我充其量只是 python 的新手。我一直在尝试创建一个功能,将特定数量的图像从所需的谷歌图像搜索下载到您的谷歌驱动器中的特定文件夹中。但是我遇到了一个我无法解决的问题;请有人指出我哪里出错或指出我正确的方向来解决它。我相信问题是im = requests.get(link)(第 36 行)。到目前为止,我有以下内容:

# mount the drive
from google.colab import drive
drive.mount('/content/gdrive')


#module import
import requests
from bs4 import BeautifulSoup


#define parameters of search
query = input("Images of:") 
print("Number of images:")
NumberOfImages = int(input())
FolderLocation = input("Input Folder Location:")
image_type="ActiOn"
query= query.split()
query='+'.join(query)
url="https://www.google.co.in/search?q="+query+"&source=lnms&tbm=isch"
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}

# soup
request = requests.get(url,headers=headers)
soup = BeautifulSoup(request.text,'html.parser')
images = soup.find_all('img')

# loop to save
tik = 0
for image in images:
  if tik <= NumberOfImages:
    link = image['src']
    name = query+"_"+str(tik)
    print(link, name)
    with open(FolderLocation+"/"+name+".jpg",'wb') as f:
      im = requests.get(link)
      f.write(im.content)
      print("Writing "+name+ " to file")
    tik +=1
  else:
    break

这是从谷歌请求“src”链接的问题,还是我遗漏了其他东西?

任何帮助将不胜感激。谢谢。

【问题讨论】:

    标签: beautifulsoup python-requests google-colaboratory google-image-search


    【解决方案1】:

    为了使用requestsbeautifulsoup 抓取全分辨率图像URL,您需要通过regex 从页面源代码(CTRL+U)抓取数据。

    查找所有&lt;script&gt;标签:

    soup.select('script')
    

    通过regex 匹配来自&lt;script&gt; 标签的图像数据:

    matched_images_data = ''.join(re.findall(r"AF_initDataCallback\(([^<]+)\);", str(all_script_tags)))
    

    通过regex匹配所需图像(全分辨率):

    matched_images_data_fix = json.dumps(matched_images_data)
    matched_images_data_json = json.loads(matched_images_data_fix)
    
    matched_google_full_resolution_images = re.findall(r"(?:'|,),\[\"(https:|http.*?)\",\d+,\d+\]",
                                                        matched_images_data_json)
    

    使用bytes()decode() 提取和解码它们,然后使用list() 切片[:20] 告诉您要提取多少元素(抓取前20 张图像):

    for fixed_full_res_image in matched_google_full_resolution_images[:20]:
        original_size_img_not_fixed = bytes(fixed_full_res_image, 'ascii').decode('unicode-escape')
        original_size_img = bytes(original_size_img_not_fixed, 'ascii').decode('unicode-escape')
    

    同样下载图片的代码和full example in the online IDE

    import requests, lxml, re, json, urllib.request
    from bs4 import BeautifulSoup
    
    headers = {
        "User-Agent":
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
    }
    
    params = {
        "q": "pexels cat",
        "tbm": "isch", 
        "hl": "en",
        "ijn": "0",
    }
    
    html = requests.get("https://www.google.com/search", params=params, headers=headers)
    soup = BeautifulSoup(html.text, 'lxml')
    
    
    def get_images_data():
    
        print('\nGoogle Images Metadata:')
        for google_image in soup.select('.isv-r.PNCib.MSM1fd.BUooTd'):
            title = google_image.select_one('.VFACy.kGQAp.sMi44c.lNHeqe.WGvvNb')['title']
            source = google_image.select_one('.fxgdke').text
            link = google_image.select_one('.VFACy.kGQAp.sMi44c.lNHeqe.WGvvNb')['href']
            print(f'{title}\n{source}\n{link}\n')
    
        # this steps could be refactored to a more compact
        all_script_tags = soup.select('script')
    
        # # https://regex101.com/r/48UZhY/4
        matched_images_data = ''.join(re.findall(r"AF_initDataCallback\(([^<]+)\);", str(all_script_tags)))
        
        # https://kodlogs.com/34776/json-decoder-jsondecodeerror-expecting-property-name-enclosed-in-double-quotes
        # if you try to json.loads() without json.dumps it will throw an error:
        # "Expecting property name enclosed in double quotes"
        matched_images_data_fix = json.dumps(matched_images_data)
        matched_images_data_json = json.loads(matched_images_data_fix)
    
        # https://regex101.com/r/pdZOnW/3
        matched_google_image_data = re.findall(r'\[\"GRID_STATE0\",null,\[\[1,\[0,\".*?\",(.*),\"All\",', matched_images_data_json)
    
        # https://regex101.com/r/NnRg27/1
        matched_google_images_thumbnails = ', '.join(
            re.findall(r'\[\"(https\:\/\/encrypted-tbn0\.gstatic\.com\/images\?.*?)\",\d+,\d+\]',
                       str(matched_google_image_data))).split(', ')
    
        print('Google Image Thumbnails:')  # in order
        for fixed_google_image_thumbnail in matched_google_images_thumbnails:
            # https://stackoverflow.com/a/4004439/15164646 comment by Frédéric Hamidi
            google_image_thumbnail_not_fixed = bytes(fixed_google_image_thumbnail, 'ascii').decode('unicode-escape')
    
            # after first decoding, Unicode characters are still present. After the second iteration, they were decoded.
            google_image_thumbnail = bytes(google_image_thumbnail_not_fixed, 'ascii').decode('unicode-escape')
            print(google_image_thumbnail)
    
        # removing previously matched thumbnails for easier full resolution image matches.
        removed_matched_google_images_thumbnails = re.sub(
            r'\[\"(https\:\/\/encrypted-tbn0\.gstatic\.com\/images\?.*?)\",\d+,\d+\]', '', str(matched_google_image_data))
    
        # https://regex101.com/r/fXjfb1/4
        # https://stackoverflow.com/a/19821774/15164646
        matched_google_full_resolution_images = re.findall(r"(?:'|,),\[\"(https:|http.*?)\",\d+,\d+\]",
                                                           removed_matched_google_images_thumbnails)
    
        print('\nFull Resolution Images:')  # in order
        for index, fixed_full_res_image in enumerate(matched_google_full_resolution_images):
            # https://stackoverflow.com/a/4004439/15164646 comment by Frédéric Hamidi
            original_size_img_not_fixed = bytes(fixed_full_res_image, 'ascii').decode('unicode-escape')
            original_size_img = bytes(original_size_img_not_fixed, 'ascii').decode('unicode-escape')
            print(original_size_img)
    
            # ------------------------------------------------
            # Download original images
    
            # print(f'Downloading {index} image...')
            
            # opener=urllib.request.build_opener()
            # opener.addheaders=[('User-Agent','Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582')]
            # urllib.request.install_opener(opener)
    
            # urllib.request.urlretrieve(original_size_img, f'Bs4_Images/original_size_img_{index}.jpg')
    
    
    get_images_data()
    
    
    -------------
    '''
    Google Images Metadata:
    9,000+ Best Cat Photos · 100% Free Download · Pexels Stock Photos
    pexels.com
    https://www.pexels.com/search/cat/
    other results ...
    
    Google Image Thumbnails:
    https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcR2cZsuRkkLWXOIsl9BZzbeaCcI0qav7nenDvvqi-YSm4nVJZYyljRsJZv6N5vS8hMNU_w&usqp=CAU
    other results ...
    
    Full Resolution Images:
    https://images.pexels.com/photos/1170986/pexels-photo-1170986.jpeg?cs=srgb&dl=pexels-evg-culture-1170986.jpg&fm=jpg
    other results ...
    '''
    

    或者,您可以使用来自 SerpApi 的 Google Images API 来实现相同的目的。这是一个带有免费计划的付费 API。

    不同之处在于,您不必处理正则表达式、绕过 Google 的阻止,并在发生崩溃时随着时间的推移对其进行维护。相反,您只需要遍历结构化 JSON 并获取您想要的数据。

    要集成的示例代码:

    import os, json # json for pretty output
    from serpapi import GoogleSearch
    
    def get_google_images():
        params = {
          "api_key": os.getenv("API_KEY"),
          "engine": "google",
          "q": "pexels cat",
          "tbm": "isch"
        }
    
        search = GoogleSearch(params)
        results = search.get_dict()
    
        print(json.dumps(results['images_results'], indent=2, ensure_ascii=False))
    
    
    get_google_images()
    
    ---------------
    '''
    [
    ... # other images 
      {
        "position": 100, # img number
        "thumbnail": "https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcRR1FCGhFsr_qZoxPvQBDjVn17e_8bA5PB8mg&usqp=CAU",
        "source": "pexels.com",
        "title": "Close-up of Cat · Free Stock Photo",
        "link": "https://www.pexels.com/photo/close-up-of-cat-320014/",
        "original": "https://images.pexels.com/photos/2612982/pexels-photo-2612982.jpeg?auto=compress&cs=tinysrgb&dpr=1&w=500",
        "is_product": false
      }
    ]
    '''
    

    P.S - 我写了一篇关于 how to scrape Google Imageshow to reduce the chance of being blocked while web scraping search engines 的更深入的博文。

    免责声明,我为 SerpApi 工作。

    【讨论】:

      猜你喜欢
      • 2020-11-14
      • 2017-09-05
      • 1970-01-01
      • 2023-01-29
      • 1970-01-01
      • 2018-06-09
      • 2021-07-18
      • 2022-10-05
      • 1970-01-01
      相关资源
      最近更新 更多