是否可以从 Google 上抓取 PDF 文件？答案

【问题标题】：Is it possible to scrape Google for PDF files?是否可以从 Google 上抓取 PDF 文件？
【发布时间】：2020-09-30 16:18:55
【问题描述】：

是否可以从 Google 上抓取 PDF 文件？例如，下载给定术语的一定数量的搜索结果中的所有“.pdf”文件。 Webscraping 对我来说相当新，虽然我一直在使用 beautifulsoup4，如果可能的话。

提前致谢。

【问题讨论】：

您可能应该考虑使用 Scrapy 来补充 BeautifulSoup。如果说 scaping Google 是指向 Google 查询并抓取返回的结果，这并不容易，因为这违反了 Google 的用户协议。经过多次查询后，Google 将检测到这种异常活动，并开始将您的网页请求重新路由到需要手动用户交互（即那些 CAPTCHA 的东西）的单独页面，这使得抓取几乎不可能。但是，如果您愿意为 Google App Engine 帐户付费，则可以合法地这样做。搜索“Google 应用引擎网页抓取”。

标签： python web-scraping beautifulsoup search-engine

【解决方案1】：

这就是我要做的。

Google 允许您通过添加filetype:[your file type extension (pdf)] 来按文件类型搜索。
您可以通过使用直接 URL 并更改查询来绕过 Google 搜索页面：https://www.google.com/search?q=these+are+keywords+filetype%3Apdf
您可以使用 BeautifulSoup 查找每个搜索结果的 URL (relevant question's answer)。最重要的部分是每个搜索结果都有一个类“g”，因此您可以从具有该类的每个元素中获取 URL。
从那里，您可以使用 BeautifulSoup 查找 PDF 的直接 URL。 URL 的标签类型为“a”，格式为href。 Relevant question's answer

我不是专家，但也许这足以让您上路。其他人可能会提出更好的方法。

【讨论】：

【解决方案2】：

确保您使用的是user-agent，因为最终，Google 可能会阻止请求，您将收到完全不同的 HTML。 Check out what is your user-agent.

通过user-agent:

headers = {
    'User-agent':
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}

requests.get(URL, headers=headers)

首先迭代所有自然结果：

for index, result in enumerate(soup.select('.tF2Cxc')):
  # code

# enumerate() was used to provide index values after each iteration 
# that will be handy at the saving stage to use them via f-string e.g: file_0,1,2,3..

通过if 语句检查PDF 是否存在：

if result.select_one('.ZGwO7'):
  pdf_file = result.select_one('.yuRUbf a')['href']
  # other code
else: pass

要在本地保存.pdf 文件，您可以使用urllib.request.urlretrieve：

urllib.request.urlretrieve(pdf_file, "YOUR_FOLODER(s)/YOUR_PDF_FILE_NAME.pdf")
# if saving in the same folder, remove "YOUR_FOLDER" part

代码和example in the online IDE：

from bs4 import BeautifulSoup
import requests, lxml, urllib.request

headers = {
    'User-agent':
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}

params = {
  "q": "best lasagna recipe:pdf"
}

def get_pdfs():
    html = requests.get('https://www.google.com/search', headers=headers, params=params)
    soup = BeautifulSoup(html.text, 'lxml')

    for index, result in enumerate(soup.select('.tF2Cxc')):

      # check if PDF is present via according CSS class
      if result.select_one('.ZGwO7'):
        pdf_file = result.select_one('.yuRUbf a')['href']
        
        opener=urllib.request.build_opener()
        opener.addheaders=[('User-Agent','Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582')]
        urllib.request.install_opener(opener)

        # save PDF
        urllib.request.urlretrieve(pdf_file, f"bs4_pdfs/pdf_file_{index}.pdf")

        print(f'Saving PDF №{index}..')
      else: pass

-------
'''
Saving PDF №0..
Saving PDF №1..
Saving PDF №2..
...

8 pdf's saved to the desired folder
'''

或者，您可以通过使用来自 SerpApi 的 Google Organic Results API 来实现此目的。这是一个带有免费计划的付费 API。

您的情况的不同之处在于您不需要弄清楚如何提取某些部分或元素，因为它已经为最终用户完成了。

要集成的代码：

from serpapi import GoogleSearch
import os, urllib.request

def get_pdfs():
    params = {
      "api_key": os.getenv("API_KEY"),
      "engine": "google",
      "q": "best lasagna recipe:pdf",
      "hl": "en"
    }

    search = GoogleSearch(params)
    results = search.get_dict()

    for index, result in enumerate(results['organic_results']):
      if '.pdf' in result['link']:
        pdf_file = result['link']

        opener=urllib.request.build_opener()
        opener.addheaders=[('User-Agent','Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582')]
        urllib.request.install_opener(opener)

        # save PDF
        urllib.request.urlretrieve(pdf_file, f"serpapi_pdfs/pdf_file_{index}.pdf")

        print(f'Saving PDF №{index}..')
      else: pass

get_pdfs()

-------
'''
Saving PDF №0..
Saving PDF №1..
Saving PDF №2..
...

8 pdf's saved to the desired folder
'''

另外，您可以使用camelot 库从.pdf 文件中获取数据。

免责声明，我为 SerpApi 工作。

【讨论】：