tesseract 没有拾取页面右侧的字符答案

【问题标题】：tesseract not picking up characters on right side of pagetesseract 没有拾取页面右侧的字符
【发布时间】：2020-06-05 20:43:03
【问题描述】：

当循环浏览 pdf 页面时，tesseract 识别一页上的字符，类似于：

Table 1 Summary Data                    3
Table 2 Unique  Data                    5

但在另一个页面上

Table 3  Reservoir Data                 8
Table 4  Surface Data                   9

它会丢弃最后一个数字，因此输出类似于

Table 3  Reservoir Data                
Table 4  Surface Data

不解释数字 8 和 9。我检查了从 pdf2image 创建的图像

pages = convert_from_path(pdf_path, 500)

最右边的文字出现在页面图像中。

但是，下面代码中的数据框 (df) 不包含所讨论页面的任何最右侧数据，也不包含任何看起来像是已尝试识别的字符。 pdf 页面和图像质量相同，最右侧的字符位于相同的水平位置。

这是我正在使用的代码：

    custom_config = r'-c preserve_interword_spaces=1 --oem 1 --psm 1 -l eng+ita'
    for pdf_path in pdfs:
        pages = convert_from_path(pdf_path, 500)

        for pageNum,imgBlob in enumerate(pages):
            if pageNum < 8:
                if pageNum == 6:
                    d = pytesseract.image_to_data(imgBlob, config=custom_config, output_type=Output.DICT)
                    df = pd.DataFrame(d)

                    print(pageNum)
                    print(df)

我想知道是否存在 tesseract 无法读取的水平限制或边距，并将 dpi 更改为 400 - 我假设 500 是 dpi。在谷歌搜索剪辑、边距或跳过等术语时，我没有找到任何相关内容。

【问题讨论】：

标签： python ocr tesseract python-tesseract

【解决方案1】：

检查使用不同的页面分割模式是否会产生更好的结果

custom_config = r'-c preserve_interword_spaces=1 --oem 1 --psm 6 -l eng+ita'

Page segmentation modes:
  0    Orientation and script detection (OSD) only.
  1    Automatic page segmentation with OSD.
  2    Automatic page segmentation, but no OSD, or OCR. (not implemented)
  3    Fully automatic page segmentation, but no OSD. (Default)
  4    Assume a single column of text of variable sizes.
  5    Assume a single uniform block of vertically aligned text.
  6    Assume a single uniform block of text.
  7    Treat the image as a single text line.
  8    Treat the image as a single word.
  9    Treat the image as a single word in a circle.
 10    Treat the image as a single character.
 11    Sparse text. Find as much text as possible in no particular order.
 12    Sparse text with OSD.
 13    Raw line. Treat the image as a single text line,
       bypassing hacks that are Tesseract-specific.

【讨论】：

【解决方案2】：

我在使用 tesseract4 时遇到了同样的问题，@K41F4rs 的解决方案对我来说适用于页面分割模式的值为 12（带有 OSD 的稀疏文本）。

【讨论】：

【解决方案3】：

是分页方式的问题。 -- psm 3 无法检测图像中的稀疏字符。使用 psm 6、11 或 12。

【讨论】：