【问题标题】:Text Scraping a PDF with Python (pdfquery)使用 Python 抓取 PDF 的文本 (pdfquery)
【发布时间】:2018-10-06 20:36:19
【问题描述】:

我需要抓取一些 PDF 文件以提取以下文本信息:

我尝试使用 pdfquery 执行此操作,方法是处理我在 Reddit 上找到的示例(参见第一篇文章):https://www.reddit.com/r/Python/comments/4bnjha/scraping_pdf_files_with_python/

我想通过获取许可证号来测试它。我进入了生成的“xmltree”文件,找到了第一个许可证号,并在 LTTextLineHorizo​​ntal 元素中获得了 x0,y0,x1,y1 坐标。

import pdfquery
from lxml import etree


PDF_FILE = 'C:\\TEMP\\ad-4070-20-september-2018.pdf'

pdf = pdfquery.PDFQuery(PDF_FILE)
pdf.load(4,5)

with open('xmltree.xml','wb') as f:
    f.write(etree.tostring(pdf.tree, pretty_print=True))

product_info = []
page_count = len(pdf._pages)
for pg in range(page_count):
    data = pdf.extract([
        ('with_parent', 'LTPage[pageid="{}"]'.format(pg+1)),
        ('with_formatter', None),
        ('product_name', 'LTTextLineHorizontal:in_bbox("89.904, 757.502, 265.7, 770.83")'),
        ('product_details', 'LTTextLineHorizontal:in_bbox("223, 100, 737, 1114")'),
    ])
    for ix, pn in enumerate(sorted([d for d in data['product_name'] if d.text.strip()], key=lambda x: x.get('y0'), reverse=True)):
        product_info.append({'Manufacturer': pn.text.strip(), 'page': pg, 'y_start': float(pn.get('y1')), 'y_end': float(pn.get('y1'))-150})
        # if this is not the first product on the page, update the previous product's y_end with a
        # value slightly greater than this product's y coordinate start
        if ix > 0:
            product_info[-2]['y_end'] = float(pn.get('y0'))
    # for every product found on this page, find the detail information that falls between the
    # y coordinates belonging to the product
    for product in [p for p in product_info if p['page'] == pg]:
        details = []
        for d in sorted([d for d in data['product_details'] if d.text.strip()], key=lambda x: x.get('y0'), reverse=True):
            if  product['y_start'] > float(d.get('y0')) > product['y_end']:
                details.append(d.text.strip())
        product['Details'] = ' '.join(details)
pdf.file.close()

for p in product_info:
    print('Manufacturer: {}\r\nDetail Info:{}...\r\n\r\n'.format(p['Manufacturer'], p['Details'][0:100]))

但是,当我运行它时,它不会打印任何内容。没有错误,XML 文件生成正常,我直接从 XML 文件中获取坐标,所以应该没有问题。我做错了什么?

【问题讨论】:

    标签: python pdf pdfminer


    【解决方案1】:

    对于从 PDF 文件中提取文本,我最喜欢的工具是 pdftotext

    使用-layout 选项,你基本上得到一个纯文本,使用 Python 操作起来相对容易。

    下面的例子:

    """Extract text from PDF files.
    
    Requires pdftotext from the poppler utilities.
    On unix/linux install them using your favorite package manager.
    
    Binaries for ms-windows can be found at;
    1) http://blog.alivate.com.au/poppler-windows/
    2) https://sourceforge.net/projects/poppler-win32/
    """
    
    import subprocess
    
    
    def pdftotext(pdf, page=None):
        """Retrieve all text from a PDF file.
    
        Arguments:
            pdf Path of the file to read.
            page: Number of the page to read. If None, read all the pages.
    
        Returns:
            A list of lines of text.
        """
        if page is None:
            args = ['pdftotext', '-layout', '-q', pdf, '-']
        else:
            args = ['pdftotext', '-f', str(page), '-l', str(page), '-layout',
                    '-q', pdf, '-']
        try:
            txt = subprocess.check_output(args, universal_newlines=True)
            lines = txt.splitlines()
        except subprocess.CalledProcessError:
            lines = []
        return lines
    

    【讨论】:

      【解决方案2】:

      我刚刚从您的 Reddit 链接运行了代码,并且运行良好。虽然我没有您确切的 PDF 文档,但我认为您的 bbox 参数不准确。具体来说,你使用

      ('product_name', 'LTTextLineHorizontal:in_bbox("89.904, 757.502, 265.7, 770.83")'),
      

      但你应该使用

      ('product_name', 'LTTextLineHorizontal:in_bbox("88, 756, 267, 772")'),
      

      ('product_name', 'LTTextLineHorizontal:overlaps_bbox("89.904, 757.502, 265.7, 770.83")'),
      

      因为“in_bbox”要求文本真正适合该框,而“overlaps_bbox”需要文本仅与该框重叠。 'product_details' 也一样。 请注意,您 Reddit 链接中脚本的作者使用了第一个选项。

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 1970-01-01
        • 2021-06-30
        • 2018-10-04
        • 2019-06-09
        • 2020-12-29
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        相关资源
        最近更新 更多