确保您使用的是user-agent,因为最终,Google 可能会阻止请求,您将收到完全不同的 HTML。 Check out what is your user-agent.
通过user-agent:
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
requests.get(URL, headers=headers)
首先迭代所有自然结果:
for index, result in enumerate(soup.select('.tF2Cxc')):
# code
# enumerate() was used to provide index values after each iteration
# that will be handy at the saving stage to use them via f-string e.g: file_0,1,2,3..
通过if 语句检查PDF 是否存在:
if result.select_one('.ZGwO7'):
pdf_file = result.select_one('.yuRUbf a')['href']
# other code
else: pass
要在本地保存.pdf 文件,您可以使用urllib.request.urlretrieve:
urllib.request.urlretrieve(pdf_file, "YOUR_FOLODER(s)/YOUR_PDF_FILE_NAME.pdf")
# if saving in the same folder, remove "YOUR_FOLDER" part
代码和example in the online IDE:
from bs4 import BeautifulSoup
import requests, lxml, urllib.request
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
params = {
"q": "best lasagna recipe:pdf"
}
def get_pdfs():
html = requests.get('https://www.google.com/search', headers=headers, params=params)
soup = BeautifulSoup(html.text, 'lxml')
for index, result in enumerate(soup.select('.tF2Cxc')):
# check if PDF is present via according CSS class
if result.select_one('.ZGwO7'):
pdf_file = result.select_one('.yuRUbf a')['href']
opener=urllib.request.build_opener()
opener.addheaders=[('User-Agent','Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582')]
urllib.request.install_opener(opener)
# save PDF
urllib.request.urlretrieve(pdf_file, f"bs4_pdfs/pdf_file_{index}.pdf")
print(f'Saving PDF №{index}..')
else: pass
-------
'''
Saving PDF №0..
Saving PDF №1..
Saving PDF №2..
...
8 pdf's saved to the desired folder
'''
或者,您可以通过使用来自 SerpApi 的 Google Organic Results API 来实现此目的。这是一个带有免费计划的付费 API。
您的情况的不同之处在于您不需要弄清楚如何提取某些部分或元素,因为它已经为最终用户完成了。
要集成的代码:
from serpapi import GoogleSearch
import os, urllib.request
def get_pdfs():
params = {
"api_key": os.getenv("API_KEY"),
"engine": "google",
"q": "best lasagna recipe:pdf",
"hl": "en"
}
search = GoogleSearch(params)
results = search.get_dict()
for index, result in enumerate(results['organic_results']):
if '.pdf' in result['link']:
pdf_file = result['link']
opener=urllib.request.build_opener()
opener.addheaders=[('User-Agent','Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582')]
urllib.request.install_opener(opener)
# save PDF
urllib.request.urlretrieve(pdf_file, f"serpapi_pdfs/pdf_file_{index}.pdf")
print(f'Saving PDF №{index}..')
else: pass
get_pdfs()
-------
'''
Saving PDF №0..
Saving PDF №1..
Saving PDF №2..
...
8 pdf's saved to the desired folder
'''
另外,您可以使用camelot 库从.pdf 文件中获取数据。
免责声明,我为 SerpApi 工作。