如何在 Google Cloud Function 上使用 Python pdf2image 模块（因此 poppler）？答案

【问题标题】：How to use the Python pdf2image module (thus poppler) on Google Cloud Function?如何在 Google Cloud Function 上使用 Python pdf2image 模块（因此 poppler）？
【发布时间】：2021-03-21 11:46:42
【问题描述】：

我尝试在 Google Cloud Functions 上将 PDF 转换为 JPEG。我使用了 Python 模块pdf2image。但是我不知道如何解决云功能上的错误No such file or directory: 'pdfinfo'和"Unable to get page count. Is poppler installed and in PATH?。

错误代码与this question 非常相似。 pdf2image 是 poppler 的“pdftoppm”和“pdftocairo”的包装。但是如何在谷歌云功能上安装 poppler 包，并将其添加到 PATH？我找不到相关的参考资料。甚至有可能吗？如果没有，怎么办？

还有this question，不过没用。

代码如下所示。入口点是process_image。

import requests
from pdf2image import convert_from_path

def process_image(event, context):
    # Download sample pdf file
    url = 'https://www.adobe.com/support/products/enterprise/knowledgecenter/media/c4611_sample_explain.pdf'
    r = requests.get(url, allow_redirects=True)
    open('/tmp/sample.pdf', 'wb').write(r.content)

    # Error occur on this line
    pages = convert_from_path('/tmp/sample.pdf')

    # Save pages to /tmp
    for idx, page in enumerate(pages):
        output_file_path = f"/tmp/{str(idx)}.jpg"
        page.save(output_file_path, 'JPEG')
        # To be saved to cloud storage

Requirement.txt：

requests==2.25.1
pdf2image==1.14.0

这是我得到的错误代码：

Traceback (most recent call last):
  File "/layers/google.python.pip/pip/lib/python3.8/site-packages/pdf2image/pdf2image.py", line 441, in pdfinfo_from_path
    proc = Popen(command, env=env, stdout=PIPE, stderr=PIPE)
  File "/opt/python3.8/lib/python3.8/subprocess.py", line 858, in __init__
    self._execute_child(args, executable, preexec_fn, close_fds,
  File "/opt/python3.8/lib/python3.8/subprocess.py", line 1706, in _execute_child
    raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'pdfinfo'

在处理上述异常的过程中，又发生了一个异常：

Traceback (most recent call last):
  File "/layers/google.python.pip/pip/lib/python3.8/site-packages/flask/app.py", line 2447, in wsgi_app
    response = self.full_dispatch_request()
  File "/layers/google.python.pip/pip/lib/python3.8/site-packages/flask/app.py", line 1952, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/layers/google.python.pip/pip/lib/python3.8/site-packages/flask/app.py", line 1821, in handle_user_exception
    reraise(exc_type, exc_value, tb)
  File "/layers/google.python.pip/pip/lib/python3.8/site-packages/flask/_compat.py", line 39, in reraise
    raise value
  File "/layers/google.python.pip/pip/lib/python3.8/site-packages/flask/app.py", line 1950, in full_dispatch_request
    rv = self.dispatch_request()
  File "/layers/google.python.pip/pip/lib/python3.8/site-packages/flask/app.py", line 1936, in dispatch_request
    return self.view_functions[rule.endpoint](**req.view_args)
  File "/layers/google.python.pip/pip/lib/python3.8/site-packages/functions_framework/__init__.py", line 149, in view_func
    function(data, context)
  File "/workspace/main.py", line 11, in process_image
    pages = convert_from_path('/tmp/sample.pdf')
  File "/layers/google.python.pip/pip/lib/python3.8/site-packages/pdf2image/pdf2image.py", line 97, in convert_from_path
    page_count = pdfinfo_from_path(pdf_path, userpw, poppler_path=poppler_path)["Pages"]
  File "/layers/google.python.pip/pip/lib/python3.8/site-packages/pdf2image/pdf2image.py", line 467, in pdfinfo_from_path
    raise PDFInfoNotInstalledError(
pdf2image.exceptions.PDFInfoNotInstalledError: Unable to get page count. Is poppler installed and in PATH?

提前感谢您的帮助。

【问题讨论】：

标签： python image pdf google-cloud-functions poppler

【解决方案1】：

出现此错误是因为 poppler 包在 Cloud Functions 中不起作用，因为它需要将某些文件写入系统。不幸的是，您无法在 Cloud Functions 等无服务器产品中写入文件系统。

您可能想尝试其他线程Cloud Functions for Firebase - Converting PDF to image 中描述的方法，或者考虑使用可以访问整个系统的 GCP 计算引擎。

【讨论】：

【解决方案2】：

Cloud Functions 不支持安装自定义系统级包（即使它支持使用 npm、pip 等包管理器的相关编程语言的第三方库）。如https://cloud.google.com/functions/docs/reference/system-packages所示，没有“poppler”包。

但是，您仍然可以使用其他预安装的软件包。 ghostscript可用于将pdf转为图片。

首先，您应该将 pdf 文件保存在云功能中（例如从云存储中）。您只有 /tmp 的磁盘写入权限 (https://cloud.google.com/functions/docs/concepts/exec#file_system)。

将pdf转换为jpeg的终端命令示例如下

gs -dSAFER -dNOPAUSE -dBATCH -sDEVICE=jpeg -dJPEGQ=100 -r300 -sOutputFile=output/file/path input/file/path

在python环境中使用命令的示例代码：

# download the file from google cloud storage
gcs = storage.Client(project=os.environ['GCP_PROJECT'])
bucket = gcs.bucket(bucket_name)
blob = bucket.blob(file_name)
blob.download_to_filename(input_file_path)

# run ghostscript
cmd = f'gs -dSAFER -dNOPAUSE -dBATCH -sDEVICE=jpeg -dJPEGQ=100 -r300 -sOutputFile="{output_file_path}" {input_file_path}'.split(' ')
p = subprocess.Popen(cmd, stderr=subprocess.PIPE, stdout=subprocess.PIPE)
stdout, stderr = p.communicate()
error = stderr.decode('utf8')
if error:
    logging.error(error)
    return

注意：您可能想改用 imagemagick 包，它本身使用 ghostscript。但是，正如Can't load PDF with Wand/ImageMagick in Google Cloud Function 中所述，由于 Ghostscript 在撰写本文时 (2021-07-12) 存在安全漏洞，ImageMagick 读取 PDF 已被禁用。提供的解决方案本质上是运行 ghostscript 的另一种方式。

参考： https://www.the-swamp.info/blog/google-cloud-functions-system-packages/

【讨论】：