为新字体训练 Tesseract答案

【问题标题】：Training Tesseract for a new font为新字体训练 Tesseract
【发布时间】：2015-01-09 12:04:10
【问题描述】：

当使用创建聚类数据时

mftraining -F font_properties -U unicharset -O lan.unicharset *.tr

我收到以下消息

C:\Users\ \AppData\Local\Tesseract-OCR>mftraining -F font_properties -U unicharset -O eng1.unicharset eng.lucidaconsole.box.tr <http://eng.lucidaconsole.box.tr>

Warning: No shape table file present: shapetable
Failed to load unicharset from file unicharset
Building unicharset for training from scratch...
Failed to load unicharset from file unicharset
Building unicharset for boosting from scratch...
Failed to load unicharset from file unicharset
Building unicharset for boosting from scratch...
Failed to load unicharset from file unicharset
Building unicharset for boosting from scratch...
Reading eng.lucidaconsole.box.tr <http://eng.lucidaconsole.box.tr> ...

Flat shape table summary: Number of shapes = 0 max unichars = 0 number with multiple unichars = 0

Done!

它重建了我已经完成的 unicharset 并给了我一个 1kb 价值只有这个的数据

1
NULL 0 NULL 0

此时我不知道该怎么办。我是这个程序的第一次用户，但对我来说这似乎不对？

【问题讨论】：

我为你清理了你的问题。发帖时请尽量让内容看起来不错，欢迎使用 StackOverflow。

标签： tesseract

【解决方案1】：

您似乎需要对训练页面的字符特征进行聚类，如here 所述。

我相信这个的基本命令是这样的：

shapeclustering -F font_properties -U unicharset lang.fontname.exp0.tr lang.fontname.exp1.tr ...

这似乎是 3.02 版中添加的内容。

【讨论】：

你知道链接页面移动到哪里了吗？我找不到一个好的匹配。谢谢
遗憾的是没有。 Google 代码的 Exodus 造成了损失。

【解决方案2】：

如果您使用的是 Windows，我认为 this tool 可以帮助您简化培训过程。在使用 Tesseract 之前，我在学习如何训练 Tesseract 时遇到了很多麻烦。只需下载最新版本并阅读用户手册，您就可以在不触摸键盘的情况下训练您的 Tesseract！

【讨论】：