如何检测字体为 PDF 中使用的粗体/斜体/普通字体答案

【问题标题】：How a font is detected to be bold/italic/plain that is used in PDF如何检测字体为 PDF 中使用的粗体/斜体/普通字体
【发布时间】：2013-05-17 21:08:42
【问题描述】：

在使用 MuPDF 库从 PDF 中提取内容时，我得到的只是字体名称而不是它的字体。

我猜（例如字体名称中的粗体虽然不是正确的方式）还是有任何其他方法可以检测到特定字体是粗体/斜体/纯文本。

【问题讨论】：

你是如何提取信息的？
使用 MUPDF 开源库。
字体带有许多标志，除了它的名字，它可能会或可能不会告诉你更多关于字体属性的信息。这些都不是很可靠。
@OnceUponATimeInTheWest 我认为你是对的。但是有没有我可以使用的 Java 字体解析器。
最后我通过它的 FontDescriptor 加载字体然后找到它的属性来解决它...谢谢大家。

标签： pdf mupdf

【解决方案1】：

我使用itextsharp 提取字体系列、字体颜色等

public void Extract_inputpdf() {

  text_input_File = string.Empty;

  StringBuilder sb_inputpdf = new StringBuilder();
  PdfReader reader_inputPdf = new PdfReader(path); //read PDF
  for (int i = 0; i <= reader_inputPdf.NumberOfPages; i++) {

    TextWithFont_inputPdf inputpdf = new TextWithFont_inputPdf();
    text_input_File = iTextSharp.text.pdf.parser.PdfTextExtractor.GetTextFromPage(reader_inputPdf, i, inputpdf);

    sb_inputpdf.Append(text_input_File);
    input_pdf = sb_inputpdf.ToString();
  }
  reader_inputPdf.Close();
  clear();
}

public class TextWithFont_inputPdf: iTextSharp.text.pdf.parser.ITextExtractionStrategy {
  public void RenderText(iTextSharp.text.pdf.parser.TextRenderInfo renderInfo) {

    string curFont = renderInfo.GetFont().PostscriptFontName;
    string divide = curFont;
    string[] fontnames = null;

    //split the words from postscript if u want separate. it will be in this
  }
}
public string GetResultantText() {

  return result.ToString();
}

【讨论】：

【解决方案2】：

PDF 规范包含允许您指定字体样式的条目。然而不幸的是，在现实世界中，您经常会发现这些都不存在。

如果字体被引用而不是嵌入，这通常意味着您被字体的 PostScript 名称所困扰。它需要一些启发式方法，但通常名称提供了有关样式的足够线索。听起来这几乎就是你所在的地方。

如果字体是嵌入的，您可以解析它并尝试从嵌入的字体程序中查找样式信息。如果它被子集化，那么理论上这些信息可能会被删除，但总的来说我认为不会。但是解析 TrueType/OpenType 字体很无聊，你可能觉得不值得。

我在 ABCpdf .NET 软件组件上工作，因此我的回复可能包含基于 ABCpdf 的概念。这只是我所知道的。 :-)"

【讨论】：