从图像中自动检测语言以进行 OCR 字符提取答案

【问题标题】：Automatic Language detection from Images for OCR character Extraction从图像中自动检测语言以进行 OCR 字符提取
【发布时间】：2017-12-07 06:20:36
【问题描述】：

我正在使用python构建一个上传图片的软件。该软件将使用tesseract ocr提取文本。

但我希望我的软件能够自动检测图像中的语言并提取检测到的文本。

请建议我一些方法来做到这一点，我也准备好进行机器学习，但我无法确定该过程的完美管道。

提前致谢。

【问题讨论】：

标签： python-3.x opencv image-processing tensorflow python-tesseract

【解决方案1】：

过程复杂，你需要做的是

从 lang=eng 中的图像中提取文本
将该文本传递给 langdetect，它是谷歌自动语言检测库
再次在 tesseract 中使用该语言准确提取文本

或者

您可以对每种语言使用 switch case，并将示例文本传递给 langdetect 以获得哪种语言正确的概率。

import pytesseract

pytesseract.pytesseract.tesseract_cmd = 
'<full_path_to_your_tesseract_executable>'
# Include the above line, if you don't have tesseract executable in your path

# Example tesseract_cmd: 'C:\\Program Files (x86)\\Tesseract-OCR\\tesseract'

print(pytesseract.image_to_string(Image.open('test.png')))
print(pytesseract.image_to_string(Image.open('test-european.jpg'), lang='eng'))

sample_text = pytesseract.image_to_string(Image.open('image.jpg'), lang='eng')

from langdetect import detect_langs detect_langs(sample_text)

【讨论】：

switch case 更好，因为我的图片会使用不同的语言。
实际上我不能，因为我是新用户，堆栈交换不允许我这样做

【解决方案2】：

Tesseract 在“OSD”中有脚本检测，但没有语言检测，您无法自动检测语言，您必须指定语言。

【讨论】：

如果不能自动检测语言，tesseract是否还有其他工作流程可以使用机器学习检测图像中的语言，然后将检测到的语言类型返回到tesseract进行ocr提取？
@CyborgSuraj 调查 Tika。特别是 TikaOCR：cwiki.apache.org/confluence/display/TIKA/TikaOCR