【问题标题】:How to extract text from online PDF using pdfminer in python如何在python中使用pdfminer从在线PDF中提取文本
【发布时间】:2020-02-10 21:47:45
【问题描述】:

我想使用 pdfminer 使用下面的代码从在线 PDF 中提取文本,它没有显示错误但输出什么都没有

from pdfminer.pdfpage import PDFPage
from urllib import request
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from io import StringIO
from io import open

def readPDF(pdfFile):
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, laparams=laparams)
    PDFPage.get_pages(rsrcmgr, device, pdfFile)
    device.close()
    content = retstr.getvalue()
    retstr.close()
    return content

pdfFile = request.urlopen("https://www.jstage.jst.go.jp/article/cancer/9/0/9_KJ00003588219/_pdf/-char/en")
outputString = readPDF(pdfFile)
print(outputString)

【问题讨论】:

    标签: python web-scraping pdfminer


    【解决方案1】:

    我建议你使用 pdftotext 库来提取文本。

    import pdftotext
    fh = open(document_name, 'rb')
    pdf = pdftotext.PDF(fh)
    text = ""
    for page in pdf:
        text += page
    print(text)
    

    【讨论】:

    • pdftotext 没有在 windows 中安装,我试过了
    • 请点击此链接,它可能对您有帮助stackoverflow.com/questions/52336495/…
    • path = 'localpath\\pdftotext.exe' 导入子进程 subprocess.call([path]) fh = open("jstage.jst.go.jp/article/cancer/9/0/9_KJ00003588219/_pdf/-char/…", 'rb') pdf = subprocess.PDF(fh) text = "" for page in pdf: text += page print(text)
    • 一旦检查上面的代码它在执行时给出的选项,你能帮我吗
    • 您好抱歉回复晚了,请检查您是否在 fh 变量中获取 pdf 文件。
    【解决方案2】:

    以下代码适用于 Python 3.7.4

    from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
    from pdfminer.layout import LAParams
    from pdfminer.converter import TextConverter
    from pdfminer.pdfpage import PDFPage
    import io
    import urllib.request
    import requests
    
    
    def pdf_to_text(pdf_file):
        text_memory_file = io.StringIO()
    
        rsrcmgr = PDFResourceManager()
        device = TextConverter(rsrcmgr, text_memory_file, laparams=LAParams())
        interpreter = PDFPageInterpreter(rsrcmgr, device)
        # get first 3 pages of the pdf file
        for page in PDFPage.get_pages(pdf_file, pagenos=(0, 1, 2)):
            interpreter.process_page(page)
        text = text_memory_file.getvalue()
        text_memory_file.close()
        return text
    
    # # online pdf to text by urllib
    # online_pdf_file=urllib.request.urlopen('http://www.dabeaz.com/python/UnderstandingGIL.pdf')
    # pdf_memory_file=io.BytesIO()
    # pdf_memory_file.write(online_pdf_file.read())
    # print(pdf_to_text(pdf_memory_file))
    
    
    # online pdf to text by requests
    response = requests.get('http://www.dabeaz.com/python/UnderstandingGIL.pdf')
    pdf_memory_file = io.BytesIO()
    pdf_memory_file.write(response.content)
    print(pdf_to_text(pdf_memory_file))
    

    【讨论】:

      猜你喜欢
      • 2014-12-17
      • 1970-01-01
      • 1970-01-01
      • 2021-10-05
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多