【问题标题】：how to display pdf file contents as well as its full name in the browser using cgi python script?如何使用 cgi python 脚本在浏览器中显示 pdf 文件内容及其全名？
【发布时间】：2014-09-18 02:33:08
【问题描述】：

我希望在浏览器上显示 pdf 文件的完整路径及其内容。我的脚本有一个输入 html，用户将在其中输入文件名并提交表单。该脚本将搜索文件，如果在子目录中找到，则会将文件内容输出到浏览器并显示其名称。我能够显示内容，但也无法同时显示完整的名称或者如果我显示文件名，我会得到内容的垃圾字符显示。请指导。

enter link description here

脚本 a.py：

import os
import cgi
import cgitb 
cgitb.enable()
import sys
import webbrowser

def check_file_extension(display_file):
    input_file = display_file
    nm,file_extension = os.path.splitext(display_file)
    return file_extension

form = cgi.FieldStorage()

type_of_file =''
file_nm = ''
nm =''
not_found = 3

if form.has_key("file1"):
    file_nm = form["file1"].value

type_of_file = check_file_extension(file_nm)

pdf_paths = [ '/home/nancy/Documents/',]

# Change the path while executing on the server , else it will throw error 500
image_paths = [ '/home/nancy/Documents/']


if type_of_file == '.pdf':
    search_paths = pdf_paths
else:
    # .jpg
    search_paths = image_paths
for path in search_paths:
    for root, dirnames, filenames in os.walk(path):
        for f in filenames:
            if f == str(file_nm).strip():
                absolute_path_of_file = os.path.join(root,f)
                # print 'Content-type: text/html\n\n'
                # print '<html><head></head><body>'
                # print absolute_path_of_file
                # print '</body></html>'
#                 print """Content-type: text/html\n\n
# <html><head>absolute_path_of_file</head><body>
# <img src=file_display.py />
# </body></html>"""
                not_found = 2
                if  search_paths == pdf_paths:
                    print 'Content-type: application/pdf\n'
                else:
                    print 'Content-type: image/jpg\n'
                file_read = file(absolute_path_of_file,'rb').read()
                print file_read
                print 'Content-type: text/html\n\n'
                print absolute_path_of_file
                break
        break
    break

if not_found == 3:
    print  'Content-type: text/html\n'
    print '%s not found' % absolute_path_of_file

html 是一个普通的 html，只有 1 个文件名输入字段。

【问题讨论】：

标签： python html cgi content-type

【解决方案1】：

这是不可能的。至少没那么简单。一些网络浏览器不显示 PDF，但要求用户下载文件，一些自己显示它们，一些嵌入外部 PDF 查看器组件，一些启动外部 PDF 查看器。没有标准的跨浏览器方式将 PDF 嵌入 HTML，如果您想显示任意文本和 PDF 内容，则需要这种方式。

适用于每个浏览器的后备解决方案会将服务器上的 PDF 页面呈现为图像并将其提供给客户端。这会给服务器（处理器、用于缓存的内存/磁盘、带宽）带来一些压力。

一些支持 HTML5 的现代浏览器可以在画布元素上呈现带有 Mozilla's pdf.js 的 PDF。

对于其他的你可以尝试使用<embed>/<object>来使用Adobe的插件作为described on Adobe's The PDF Developer Junkie Blog。

在服务器上渲染页面

将 PDF 页面作为图像呈现和提供服务需要服务器上的一些软件来查询页数并将给定页面提取和呈现为图像。

可以使用 Xpdf 中的 pdfinfo 程序或 libpoppler 命令行实用程序来确定页数。可以使用 ImageMagick 工具中的convert 将页面从 PDF 文件转换为 JPG 图像。使用这些程序的一个非常简单的 CGI 程序：

#!/usr/bin/env python
import cgi
import cgitb; cgitb.enable()
import os
from itertools import imap
from subprocess import check_output

PDFINFO = '/usr/bin/pdfinfo'
CONVERT = '/usr/bin/convert'
DOC_ROOT = '/home/bj/Documents'

BASE_TEMPLATE = (
    'Content-type: text/html\n\n'
    '<html><head><title>{title}</title></head><body>{body}</body></html>'
)
PDF_PAGE_TEMPLATE = (
    '<h1>{filename}</h1>'
    '<p>{prev_link} {page}/{page_count} {next_link}</p>'
    '<p><img src="{image_url}" style="border: solid thin gray;"></p>'
)

SCRIPT_NAME = os.environ['SCRIPT_NAME']


def create_page_url(filename, page_number, type_):
    return '{0}?file={1}&page={2}&type={3}'.format(
        cgi.escape(SCRIPT_NAME, True),
        cgi.escape(filename, True),
        page_number,
        type_
    )


def create_page_link(text, filename, page_number):
    text = cgi.escape(text)
    if page_number is None:
        return '<span style="color: gray;">{0}</span>'.format(text)
    else:
        return '<a href="{0}">{1}</a>'.format(
            create_page_url(filename, page_number, 'html'), text
        )


def get_page_count(filename):

    def parse_line(line):
        key, _, value = line.partition(':')
        return key, value.strip()

    info = dict(
        imap(parse_line, check_output([PDFINFO, filename]).splitlines())
    )
    return int(info['Pages'])


def get_page(filename, page_index):
    return check_output(
        [
            CONVERT,
            '-density', '96',
            '{0}[{1}]'.format(filename, page_index),
            'jpg:-'
        ]
    )


def send_error(message):
    print BASE_TEMPLATE.format(
        title='Error', body='<h1>Error</h1>{0}'.format(message)
    )


def send_page_html(_pdf_path, filename, page_number, page_count):
    body = PDF_PAGE_TEMPLATE.format(
        filename=cgi.escape(filename),
        page=page_number,
        page_count=page_count,
        image_url=create_page_url(filename, page_number, 'jpg'),
        prev_link=create_page_link(
            '<<', filename, page_number - 1 if page_number > 1 else None
        ),
        next_link=create_page_link(
            '>>',
            filename,
            page_number + 1 if page_number < page_count else None
        )
    )
    print BASE_TEMPLATE.format(title='PDF', body=body)


def send_page_image(pdf_path, _filename, page_number, _page_count):
    image_data = get_page(pdf_path, page_number - 1)
    print 'Content-type: image/jpg'
    print 'Content-Length:', len(image_data)
    print
    print image_data


TYPE2SEND_FUNCTION = {
    'html': send_page_html,
    'jpg': send_page_image,
}


def main():
    form = cgi.FieldStorage()
    filename = form.getfirst('file')
    page_number = int(form.getfirst('page', 1))
    type_ = form.getfirst('type', 'html')

    pdf_path = os.path.abspath(os.path.join(DOC_ROOT, filename))
    if os.path.exists(pdf_path) and pdf_path.startswith(DOC_ROOT):
        page_count = get_page_count(pdf_path)
        page_number = min(max(1, page_number), page_count)
        TYPE2SEND_FUNCTION[type_](pdf_path, filename, page_number, page_count)
    else:
        send_error(
            '<p>PDF file <em>{0!r}</em> not found.</p>'.format(
                cgi.escape(filename)
            )
        )


main()

libpoppler 有 Python 绑定，因此对外部 pdfinfo 程序的调用可以很容易地替换为该模块。它还可以用于提取页面的更多信息，例如 PDF 页面上的链接，以便为它们创建 HTML 图像映射。安装了 libcairo Python 绑定后，甚至可以在没有外部进程的情况下渲染页面。

【讨论】：

您能否提出一个可以实现这一目标的替代解决方案？
@user956424 我已经为答案添加了一些解决方案。
stackoverflow.com/users/3815611/blackjack 我想用cgi，python单独实现这个..no js
@user956424 我添加了一个在服务器上呈现页面的 CGI。为什么没有 JavaScript？为什么选择 CGI（而不是 WSGI）？这些要求从何而来？
鉴于其他限制，恕我直言，这是不可能的。您必须“允许”JavaScript 和/或第三方浏览器插件，并且必须接受并非每个浏览器都能以您想要的方式显示它的事实。您还可以通过在服务器上呈现页面的解决方案在服务器上预渲染和/或缓存页面图像以降低 CPU 负载。