【问题标题】:Not quite understanding how to preform a request of googles servers with python requests不太了解如何使用 python 请求执行谷歌服务器的请求
【发布时间】:2018-02-15 12:45:49
【问题描述】:

我刚才的问题是无法正确形成谷歌服务器的请求,我已经尝试放入我的浏览器(Chrome)使用的所有请求标头,但这似乎不起作用。这样做的最终目标是能够在请求中指定搜索词、分辨率和 jpg 的文件类型,并将图像下载到文件夹。任何建议都将受到欢迎并提前感谢

到目前为止,这是我的代码:

def funRequestsDownload(searchTerm):
print("Getting image for track ", searchTerm)
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36', 'content-length': bytes(searchTerm, 'utf-8')}
queryStringParameters = {'hl': "en", "tbm": "isch", "source": "hp", "biw":1109, "bih": 475, "q": "SEARCH TERMS", "oq":"meme", "gs_l":"img.3..35i39k1j0l9.21651.21983.0.22205.10.10.0.0.0.0.131.269.2j1.3.0....0...1.1.64.img..7.3.267.0.4mTf5BYtfj8"}
payload = {'value': searchTerm}
url = 'http://www.google.co.uk'
dataDump = requests.get(url, data=payload, headers=headers, "Query String Parameters"=queryStringParameters)
temp = dataDump.content
with open('C:/Users/Jordan/Desktop/Music Program/temp.html', 'w') as file:
    file.write(str(temp))
    file.close
return(temp)
print("Downloaded image for track ", searchTerm)

旁注,我知道我唯一保存的是页面的 html,这是因为它返回了错误的请求页面,我想查看所述错误。

【问题讨论】:

    标签: python html python-requests google-image-search


    【解决方案1】:

    Google doesn't like people using scraping to access search results。他们更喜欢你使用他们的 API。

    他们提供的 API 称为Google Custom Search。它支持搜索图像。要使用他们的 API,您需要一个 adsense 帐户。使用从中获得的 API 密钥进行 API 调用。

    您要访问的网址是

    searchUrl = "https://www.googleapis.com/customsearch/v1?q=" + \
                 searchTerm + "&start=" + startIndex + "&key=" + key + "&cx=" + cx + \
                 "&searchType=image"
    

    通过请求传递它以获取带有结果的 JSON 文件。

    Here's 进一步阅读。

    【讨论】:

      【解决方案2】:

      首先,http://www.google.co.uk -> http://www.google.co.uk/search,这可能是反应不佳的原因。

      要从 Google 图片中抓取图片,您需要从位于 <scrpt> 标记中的页面源 (ctrl+u) 中解析数据。以下是您需要采取的步骤(简化但非常接近下面的实际代码):

      1. 查找所有<script>标签:
      soup.select('script')
      
      1. 通过regex 匹配来自<script> 标签的图像数据:
      matched_images_data = ''.join(re.findall(r"AF_initDataCallback\(([^<]+)\);", str(all_script_tags)))
      
      1. 通过regex匹配所需图像(全分辨率):
      matched_images_data_fix = json.dumps(matched_images_data)
      matched_images_data_json = json.loads(matched_images_data_fix)
      
      matched_google_full_resolution_images = re.findall(r"(?:'|,),\[\"(https:|http.*?)\",\d+,\d+\]",
                                                          matched_images_data_json)
      
      1. 使用bytes()decode() 提取和解码它们:
      for fixed_full_res_image in matched_google_full_resolution_images:
          original_size_img_not_fixed = bytes(fixed_full_res_image, 'ascii').decode('unicode-escape')
          original_size_img = bytes(original_size_img_not_fixed, 'ascii').decode('unicode-escape')
      
      1. 要保存图像,您可以使用urllib.request.urlretrieve,这可能是最简单的解决方案之一。

      有时它不会下载任何东西,因为请求是通过脚本(bot)发送的,如果你想解析来自谷歌图片或其他搜索引擎的图片,你需要先传递user-agent,然后再下载图片,否则会阻塞请求并抛出错误。

      user-agent 传递给urllib.request 并下载图片:

      opener=urllib.request.build_opener()
      opener.addheaders=[('User-Agent','Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582')]
      urllib.request.install_opener(opener)
      
      urllib.request.urlretrieve(URL, 'your_folder/image_name.jpg')
      

      使用an example in the online IDE在本地抓取和下载图像的代码:

      import requests, lxml, re, json, urllib.request
      from bs4 import BeautifulSoup
      
      headers = {
          "User-Agent":
          "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
      }
      
      params = {
          "q": "cat",      # query
          "tbm": "isch",   # image results
          "hl": "en",      # language
          "ijn": "0",      # batch of 100 images. "1" is another 100 images and so on.
      }
      
      html = requests.get("https://www.google.com/search", params=params, headers=headers)
      soup = BeautifulSoup(html.text, 'lxml')
      
      
      def get_images_data():
      
          print('\nGoogle Images Metadata:')
          for google_image in soup.select('.isv-r.PNCib.MSM1fd.BUooTd'):
              title = google_image.select_one('.VFACy.kGQAp.sMi44c.lNHeqe.WGvvNb')['title']
              source = google_image.select_one('.fxgdke').text
              link = google_image.select_one('.VFACy.kGQAp.sMi44c.lNHeqe.WGvvNb')['href']
              print(f'{title}\n{source}\n{link}\n')
      
          # this steps could be refactored to a more compact
          all_script_tags = soup.select('script')
      
          # # https://regex101.com/r/48UZhY/4
          matched_images_data = ''.join(re.findall(r"AF_initDataCallback\(([^<]+)\);", str(all_script_tags)))
          
          # https://kodlogs.com/34776/json-decoder-jsondecodeerror-expecting-property-name-enclosed-in-double-quotes
          # if you try to json.loads() without json.dumps() it will throw an error:
          # "Expecting property name enclosed in double quotes"
          matched_images_data_fix = json.dumps(matched_images_data)
          matched_images_data_json = json.loads(matched_images_data_fix)
      
          # https://regex101.com/r/pdZOnW/3
          matched_google_image_data = re.findall(r'\[\"GRID_STATE0\",null,\[\[1,\[0,\".*?\",(.*),\"All\",', matched_images_data_json)
      
          # https://regex101.com/r/NnRg27/1
          matched_google_images_thumbnails = ', '.join(
              re.findall(r'\[\"(https\:\/\/encrypted-tbn0\.gstatic\.com\/images\?.*?)\",\d+,\d+\]',
                         str(matched_google_image_data))).split(', ')
      
          print('Google Image Thumbnails:')  # in order
          for fixed_google_image_thumbnail in matched_google_images_thumbnails:
              # https://stackoverflow.com/a/4004439/15164646 comment by Frédéric Hamidi
              google_image_thumbnail_not_fixed = bytes(fixed_google_image_thumbnail, 'ascii').decode('unicode-escape')
      
              # after first decoding, Unicode characters are still present. After the second iteration, they were decoded.
              google_image_thumbnail = bytes(google_image_thumbnail_not_fixed, 'ascii').decode('unicode-escape')
              print(google_image_thumbnail)
      
          # removing previously matched thumbnails for easier full resolution image matches.
          removed_matched_google_images_thumbnails = re.sub(
              r'\[\"(https\:\/\/encrypted-tbn0\.gstatic\.com\/images\?.*?)\",\d+,\d+\]', '', str(matched_google_image_data))
      
          # https://regex101.com/r/fXjfb1/4
          # https://stackoverflow.com/a/19821774/15164646
          matched_google_full_resolution_images = re.findall(r"(?:'|,),\[\"(https:|http.*?)\",\d+,\d+\]",
                                                             removed_matched_google_images_thumbnails)
      
      
          print('\nFull Resolution Images:')  # in order
          for index, fixed_full_res_image in enumerate(matched_google_full_resolution_images):
              # https://stackoverflow.com/a/4004439/15164646 comment by Frédéric Hamidi
              original_size_img_not_fixed = bytes(fixed_full_res_image, 'ascii').decode('unicode-escape')
              original_size_img = bytes(original_size_img_not_fixed, 'ascii').decode('unicode-escape')
              print(original_size_img)
      
              # ------------------------------------------------
              # Download original images
      
              # print(f'Downloading {index} image...')
              
            # opener=urllib.request.build_opener()
            # opener.addheaders=[('User-Agent','Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582')]
            # urllib.request.install_opener(opener)
      
            # urllib.request.urlretrieve(original_size_img, f'Bs4_Images/original_size_img_{index}.jpg')
      

      或者,您可以使用来自 SerpApi 的 Google Images API 来实现相同的目的。这是一个带有免费计划的付费 API。

      不同之处在于,您不必处理正则表达式、绕过 Google 的阻止以及在发生崩溃时随时间维护代码(将在 HTML 中更改)。相反,您只需要遍历结构化 JSON 并获取您想要的数据。

      要集成的示例代码:

      import os, json # json for pretty output
      from serpapi import GoogleSearch
      
      def get_google_images():
          params = {
            "api_key": os.getenv("API_KEY"),
            "engine": "google",
            "q": "pexels cat",
            "tbm": "isch"
          }
      
          search = GoogleSearch(params)
          results = search.get_dict()
      
          print(json.dumps(results['images_results'], indent=2, ensure_ascii=False))
      
      
      get_google_images()
      
      ---------------
      '''
      [
      ... # other images 
        {
          "position": 100, # img number
          "thumbnail": "https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcRR1FCGhFsr_qZoxPvQBDjVn17e_8bA5PB8mg&usqp=CAU",
          "source": "pexels.com",
          "title": "Close-up of Cat · Free Stock Photo",
          "link": "https://www.pexels.com/photo/close-up-of-cat-320014/",
          "original": "https://images.pexels.com/photos/2612982/pexels-photo-2612982.jpeg?auto=compress&cs=tinysrgb&dpr=1&w=500",
          "is_product": false
        }
      ]
      '''
      

      P.S - 我写了一篇关于 how to scrape Google Imageshow to reduce the chance of being blocked while web scraping search engines 的更深入的博文。

      免责声明,我为 SerpApi 工作。

      【讨论】:

        猜你喜欢
        • 2020-06-08
        • 2014-05-02
        • 2023-01-27
        • 1970-01-01
        • 2019-07-03
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        相关资源
        最近更新 更多