【发布时间】:2022-10-05 23:20:53
【问题描述】:
我正在尝试使用 PyPDF2 模块从 pdf 文档中读取文本。当我在页面对象上调用 extractText() 函数时,它会导致某些 pdf 出现此错误。这是我调用函数之后的回溯。我不明白为什么某些pdf会发生这种情况,而其他pdf则不会。我正在阅读的所有 pdf 文件都可以在其中选择/搜索文本。不幸的是,我无法给出任何 pdf 文件的示例。
File \"C:\\Python39\\lib\\site-packages\\PyPDF2\\pdf.py\", line 2595, in extractText
content = ContentStream(content, self.pdf)
File \"C:\\Python39\\lib\\site-packages\\PyPDF2\\pdf.py\", line 2674, in __init__
self.__parseContentStream(stream)
File \"C:\\Python39\\lib\\site-packages\\PyPDF2\\pdf.py\", line 2706, in __parseContentStream
operands.append(readObject(stream, None))
File \"C:\\Python39\\lib\\site-packages\\PyPDF2\\generic.py\", line 66, in readObject
return DictionaryObject.readFromStream(stream, pdf)
File \"C:\\Python39\\lib\\site-packages\\PyPDF2\\generic.py\", line 582, in readFromStream
elif pdf.strict:
AttributeError: \'NoneType\' object has no attribute \'strict\'
当我 print 调用 extractText() 的页面对象时,我得到以下输出:
{\'/Tabs\': \'/S\', \'/Group\': {\'/S\': \'/Transparency\', \'/Type\': \'/Group\', \'/CS\': \'/DeviceRGB\'}, \'/Contents\': [IndirectObject(1, 0), IndirectObject(9, 0), IndirectObject(10, 0), IndirectObject(11, 0), IndirectObject(2, 0)], \'/Type\': \'/Page\', \'/Resources\': {\'/ExtGState\': {\'/GS7\': IndirectObject(12, 0), \'/GS8\': IndirectObject(13, 0)}, \'/ProcSet\': [\'/PDF\', \'/Text\', \'/ImageB\', \'/ImageC\', \'/ImageI\'], \'/XObject\': {\'/Xi6\': IndirectObject(3, 0), \'/Xi4\': IndirectObject(14, 0), \'/Xi5\': IndirectObject(15, 0), \'/Xi2\': IndirectObject(16, 0), \'/Xi3\': IndirectObject(17, 0), \'/Image22\': IndirectObject(18, 0), \'/Image11\': IndirectObject(19, 0)}, \'/Font\': {\'/F7\': IndirectObject(20, 0), \'/Xi1\': IndirectObject(21, 0), \'/F1\': IndirectObject(22, 0), \'/F2\': IndirectObject(23, 0), \'/F3\': IndirectObject(24, 0), \'/F4\': IndirectObject(25, 0), \'/F5\': IndirectObject(26, 0), \'/F6\': IndirectObject(27, 0)}, \'/Properties\': {\'/Xi0\': IndirectObject(28, 0)}}, \'/StructParents\': 0, \'/Parent\': IndirectObject(29, 0), \'/MediaBox\': [0, 0, 612, 792]}