通过保持 Python 格式来翻译 pdf 文件答案

【问题标题】：Translate pdf file by keeping the formating Python通过保持 Python 格式来翻译 pdf 文件
【发布时间】：2018-07-12 10:54:17
【问题描述】：

我正在尝试使用翻译 API 翻译 PDF 文件，并通过保持格式相同将其输出为 PDF。我的方法是将 PDF 转换为 word doc 并翻译文件，然后将其转换回 PDF。但问题是，没有有效的方法将 PDF 转换为 word。我正在尝试编写自己的程序，但 PDF 有很多格式。所以我想处理所有格式需要一些努力。所以我的问题是，是否有任何有效的方法可以在不丢失格式的情况下翻译 PDF，或者是否有任何有效的方法将它们转换为 docx。我使用 python 作为编程语言。

【问题讨论】：

尝试参考这个答案：stackoverflow.com/questions/26358281/…
@DanielIsaac 感谢您的回复，但我尝试了此解决方案，当前 libreoffice 不支持此功能。

标签： python pdf docx

【解决方案1】：

可能不会。

PDF 并不意味着机器可读或可编辑，真的；它们描述了格式化、布局、可打印的页面。

【讨论】：

【解决方案2】：

你可以在这里使用 pdfminer 代替 API：

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from io import StringIO

def convert_pdf_to_txt(path):
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
    fp = open(path, 'rb')
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    password = ""
    maxpages = 0
    caching = True
    pagenos=set()

    for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
        interpreter.process_page(page)

    text = retstr.getvalue()

    fp.close()
    device.close()
    retstr.close()
    return text

【讨论】：