Itextsharp 无法在 c# 中提取 pdf unicode 内容答案

【问题标题】：Itextsharp can't extract pdf unicode content in c#Itextsharp 无法在 c# 中提取 pdf unicode 内容
【发布时间】：2016-02-16 15:15:24
【问题描述】：

如您所见，我正在尝试使用 itextsharp 获取 pdf 文件的内容：

static void Main(string[] args)
{
    StringBuilder text = new StringBuilder();
    using (PdfReader reader = new PdfReader(@"D:\a.pdf"))
    {
        for (int i = 1; i <= reader.NumberOfPages; i++)
        {
            text.Append(PdfTextExtractor.GetTextFromPage(reader, i));
        }
    }
    System.IO.File.WriteAllText(@"c:/a.txt",text.ToString());
    Console.ReadLine();
}

我的pdf内容是用Persian写的，运行上面的代码结果是这样的：

但这不是正确的结果。我应该在itextsharp 中设置任何选项

【问题讨论】：

由于您没有显示从中提取的 PDF，因此很难说出任何内容。

标签： c# pdf unicode itextsharp persian

【解决方案1】：

没有原始文件很难说，但如果您的字符/单词放置不正确，那么您应该尝试像这样使用LocationTextExtractionStrategy：

text.Append(PdfTextExtractor.GetTextFromPage(reader, i, new LocationTextExtractionStrategy());

【讨论】：