【发布时间】:2020-02-10 21:47:45
【问题描述】:
我想使用 pdfminer 使用下面的代码从在线 PDF 中提取文本,它没有显示错误但输出什么都没有
from pdfminer.pdfpage import PDFPage
from urllib import request
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from io import StringIO
from io import open
def readPDF(pdfFile):
rsrcmgr = PDFResourceManager()
retstr = StringIO()
laparams = LAParams()
device = TextConverter(rsrcmgr, retstr, laparams=laparams)
PDFPage.get_pages(rsrcmgr, device, pdfFile)
device.close()
content = retstr.getvalue()
retstr.close()
return content
pdfFile = request.urlopen("https://www.jstage.jst.go.jp/article/cancer/9/0/9_KJ00003588219/_pdf/-char/en")
outputString = readPDF(pdfFile)
print(outputString)
【问题讨论】:
标签: python web-scraping pdfminer