调整 tesseract 以更好地检测图像中的 URL答案

【问题标题】：Tweak tesseract for better detection of URLs in image调整 tesseract 以更好地检测图像中的 URL
【发布时间】：2016-05-30 20:51:34
【问题描述】：

我的图像无法让 tesseract 识别为文本。我所有的输入文本都是 URL。

如您所见，图像尽可能清晰。

当运行tesseract test2.png stdout 时，它返回http:II11111111111111111111111111111111111 1111111111111111111.coml，这很接近，但不正确。

当将tessedit_char_whitelist 参数设置为htp:/1.com 时，它会正确识别字符串（但我也希望对 URL 进行更一般的识别）。

使用命令行tesseract test2.png stdout --user-patterns ./patterns.txt 传入如下所示的模式文件

\n\*://\n\*
http://\n\*
\n\*.com

对识别没有帮助。它仍然更喜欢I 而不是/。（有关pattern file的更多详细信息）

我还尝试将参数ok_repeated_ch_non_alphanum_wds 设置为包括/（和chs_trailing_punct{1,2} 用于尾随/，但它似乎不起作用。指定--user-words 也无济于事。（使用“词”是http://)

有没有办法为 tesseract 指定 char 优先级？

版本信息：

$ tesseract -v
tesseract 3.04.01
 leptonica-1.73
  libgif 5.1.2 : libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54 : libtiff 4.0.6 : zlib 1.2.8 : libwebp 0.4.4 : libopenjp2 2.1.0

【问题讨论】：

标签： ocr tesseract

【解决方案1】：

您可以通过将以下行添加到您的 unicharambigs 来实现此目的文件：

3 : I I 3 : / / 1

用combine_tessdata -e eng.traineddata eng.unicharambigs解压unicharambigs文件
编辑 unicharambigs 文件，例如使用nano eng.unicharambigs（确保在 3s 和第二个 / 之后都使用制表符）。
用编辑后的版本combine_tessdata -o eng.traineddata eng.unicharambigs覆盖traineddata文件中的unicharambigs文件

使用修改后的训练数据文件输出：

$ tesseract test2.png stdout
http://11111111111111111111111111111111111
1111111111111111111.coml

【讨论】：

我为最后的/ l 添加了4 c o m l 4 c o m / 1 行，但你的想法奏效了。