如何在java中使用apache tika从PDF文件中获取页眉和页脚答案

【问题标题】：How to get Header and Footer from PDF file using apache tika in java如何在java中使用apache tika从PDF文件中获取页眉和页脚
【发布时间】：2014-02-20 00:48:04
【问题描述】：

我正在使用 apache tika 从 pdf 文件中抓取内容。抓取的内容（文本）也包含页眉和页脚。我的要求是获取没有页眉和页脚的文本。下面是我的示例代码来抓取内容. 示例代码：

import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileReader;
import java.io.FileWriter;
import java.io.InputStream;
import java.util.ArrayList;
import java.util.Collections;
import java.util.Date;
import java.util.List;
import java.util.Set;
import java.util.TreeMap;
import org.apache.commons.io.FileUtils;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.sax.BodyContentHandler;
import org.json.simple.JSONObject;

public class test {

    public static void main(String[] args) throws Exception {

            String file = "C://Sample.pdf";
            File file1 = new File(file);
            InputStream input = new FileInputStream(file1);
            Metadata metadata = new Metadata();
            BodyContentHandler handler = new BodyContentHandler(
                    10 * 1024 * 1024);
            AutoDetectParser parser = new AutoDetectParser();
            parser.parse(input, handler, metadata);
            String path = "C://AUG7th".concat("/").concat(file1.getName())
                    .concat(".txt");
            String content = handler.toString();
            File file2 = new File(path);
            FileWriter fw = new FileWriter(file2.getAbsoluteFile());
            BufferedWriter bw = new BufferedWriter(fw);
            bw.write(content);
            bw.close();

    }

}

如何做到这一点请建议我。谢谢

【问题讨论】：

标签： java pdfbox apache-tika

【解决方案1】：

我还没有找到一种方法来使用 Tika 解析 pdf 的标题或页脚。您需要另一个 api 来执行此操作，例如 PDFTextSTream。

编辑：好的.. Tika 将（尝试）从 pdf 中提取原始文本和元数据。
您需要解析和分析原始文本才能删除页眉和页脚。我建议使用 PDFTextStream 而不是 Tika，因为它会简化为此目的实现算法的任务。当您使用 PDFTextStream 解析 pdf 时，您可以提取不是简单字符但它们也“携带”其他信息的 TextUnit。您还可以选择文本区域，此外还可以选择维护每个页面的视觉布局。

@Gagravarr pdf 的 XHTML 输出

<?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml">
**<head>**
<meta name="dcterms:modified" content="2012-11-21T16:08:42Z"/>
<meta name="meta:creation-date" content="2010-06-22T07:00:09Z"/>
<meta name="meta:save-date" content="2012-11-21T16:08:42Z"/>
<meta name="Content-Length" content="702419"/>
<meta name="Last-Modified" content="2012-11-21T16:08:42Z"/>
<meta name="dcterms:created" content="2010-06-22T07:00:09Z"/>
<meta name="date" content="2012-11-21T16:08:42Z"/>
<meta name="modified" content="2012-11-21T16:08:42Z"/>
<meta name="xmpTPg:NPages" content="20"/>
<meta name="Creation-Date" content="2010-06-22T07:00:09Z"/>
<meta name="created" content="Tue Jun 22 09:00:09 CEST 2010"/>
<meta name="producer" content="Atypon Systems, Inc."/>
<meta name="Content-Type" content="application/pdf"/>
<meta name="xmp:CreatorTool" content="PDFplus"/>
<meta name="resourceName" content="Lessons from a High-Impact Observatory The Hubble Space Telescope.pdf"/>
<meta name="Last-Save-Date" content="2012-11-21T16:08:42Z"/>
<meta name="dc:title" content="Lessons from a High-Impact Observatory: The &lt;italic&gt;Hubble Space Telescopes&lt;/italic&gt; Science Productivity between 1998 and 2008"/>
<title>Lessons from a High-Impact Observatory: The &lt;italic&gt;Hubble Space Telescopes&lt;/italic&gt; Science Productivity between 1998 and 2008</title>
**</head>**
**<body>**<div class="page"><p/>
<p>Lessons from a High-Impact Observatory: The Hubble Space Telescope’s Science Productivity
between 1998 and 2008
Author(s): Dániel Apai, Jill Lagerstrom, Iain Neill Reid, Karen L. Levay, Elizabeth Fraser,
Antonella Nota, and Edwin Henneken
Reviewed work(s):
Source: Publications of the Astronomical Society of the Pacific, Vol. 122, No. 893 (July 2010),
pp. 808-826
Published by: The University of Chicago Press on behalf of the Astronomical Society of the Pacific
Stable URL: http://www.jstor.org/stable/10.1086/654851 .
Accessed: 21/11/2012 11:08
</p>
<p>Your use of the JSTOR archive indicates your acceptance of the Terms &amp; Conditions of Use, available at .
http://www.jstor.org/page/info/about/policies/terms.jsp
</p>
<p> .
</p>
<p>JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of
content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms
of scholarship. For more information about JSTOR, please contact support@jstor.org.
</p>................**</body>**

在head中，Tika 为我们提供了它找到的元数据，在 body 中，它为我们提供了分段的文本（看起来也有点笨拙），它还可以给我们注释链接。所以，我不认为它很有帮助。

【讨论】：

Tika 不是在 HTML 的不同区域使用页眉和页脚标记 HTML 吗？如果是这样，你不能让你的 ContentHandler 排除这些位吗？
我认为您在谈论解析 HTML 页面，而问题是关于解析 pdf。（不确定我理解得很好）
Tika 会将您的 PDF 转换为 XHTML。我想知道您是否不能处理来自 Tika 的输出 XHTML 以排除页眉和页脚，IIRC 在 HTML 中标记为这样
是的，Tika 可以在 XHTML 中返回您的 pdf，但是您看到输出了吗？我认为它没有帮助。
@Gagravarr 参见上面的输出示例。