使用 Tesseract 从 png 无法识别文本答案

【问题标题】：The text is not recognized from png using Tesseract使用 Tesseract 从 png 无法识别文本
【发布时间】：2020-04-06 07:13:46
【问题描述】：

我必须从通过 URL 上传的 pdf 提取数据。 pdf 是 image/.png 格式，因此在使用 tesseract 包时，有几行无法识别。

代码：

library(rvest)
library(dplyr)
library(pdftools)
library(tesseract)

url="https://www.hindustancopper.com/Page/PriceCircular"
links=url %>% 
  #reading the html of the url
  read_html()%>%
  #fetching out the nodes and the attributes
  html_nodes("#viewTable li:nth-child(1) a") %>% html_attr("href")%>%
  #replacing few strings
  str_replace("../..",'')
str(links)

#using pdftools to read the pdf
base_url <- 'https://www.hindustancopper.com'
# combine the base url with the event url
event_url <- paste0(base_url, links)
event_url

#since the link has a scan copy and not the pdf itself hence using tesseract package
pdf_convert(event_url, 
            pages = 1, 
            dpi = 850, 
            filenames = "page1.png")
# what does the data look like
text <- ocr("page1.png")
cat(text)

实际输出读取产品列表及其价格：

CONTINUOUS CAST COPPER WIRE ROD 11 MM 44567 
CONTINUOUS CAST COPPER WIRE ROD NS 439678
CONTINUOUS CAST COPPER WIRE ROD 16 MM 443056...etc.

预期的输出应该是：

CONTINUOUS CAST COPPER WIRE ROD 11 MM 441567
CATHODE FULL 434122
CONTINUOUS CAST COPPER WIRE ROD NS 439678
CONTINUOUS CAST COPPER WIRE ROD 16 MM 443056...etc

我已经尝试过多次更改 dpi 参数的值，但这并没有太大帮助。提前致谢！

【问题讨论】：

你试过不同的 PSM 吗？
PSM 已经内置在这个函数中。我认为所使用的任何函数都没有提供任何声明 psm 的选项。参考以下网址：rdrr.io/github/hansthompson/pdfHarvester/src/R/Tesseract.R
您需要能够尝试另一种页面分割模式，因为它可以捕获当前 PSM 遗漏的区域。我不明白为什么它被固定为 -psm 7，它将图像视为单个文本行，这对于多行文本图像来说效果不佳。 github.com/tesseract-ocr/tesseract/blob/master/doc/…

标签： image-processing ocr tesseract pdftools propensity-score-matching

【解决方案1】：

我正在使用 Ubuntu 18.04 和 tesseract 5.0.0-alpha-647-g4a00 来执行以下命令。

我下载了您的代码中提到的一个示例 pdf。

https://www.hindustancopper.com/Upload/Reports/0-637189269505122500-AnnualReport.pdf

然后我使用此命令将其转换为 png

pdftoppm 0-637189269505122500-AnnualReport.pdf report.png -png

然后通过使用 gimp，我旋转文档以使其水平。

然后我使用这个 tesseract 命令来翻译文档。

tesseract report.png stdout -l eng --oem 3 --psm 6 -c tessedit_char_whitelist="ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789:.-/ "

结果如下：

HINDUSTAN COPPER LIMITED
A GOVT. OF INDIA ENTERPRISE
kK
Registered Head Office
Tamra Bhavan
1 Ashutosh Chowdhury Avenue
Kolkata - 700019
Ref: HCL/HO/MKTG/Cu-P/ 2019-2020
Date : 02-MAR-20
Sub: Basic Price of Cathodes and CC Rods for the month of MAR 2020.
The Basic Price of Copper Cathodes and CC Copper Rods for the month of MAR 2020 are as follows:
Basic Price Ex-Works /
Ex.Godown basis Rs. / MT
CONTINUOUS CAST COPPER WIRE ROD 11 MM 441567
CATHODE FULL 434122
CONTINUOUS CAST COPPER WIRE ROD NS 439678
CONTINUOUS CAST COPPER WIRE ROD 16 MM 443056
COPPER CATHODE CUT 437856
CONTINUOUS CAST COPPER WIRE ROD 8 MM 440078
CONTINUOUS CAST COPPER WIRE ROD 19.6 MM 444546
CONTINUOUS CAST COPPER WIRE ROD 12.5 MM 441567
Note : Monthly LME CSP Avg. : 5686.45 Monthly Avg. Exchange Rate : 71.59
The price ruling on the date of delivery will be applicable. irrespective of the date of making financial arrangements i.e.
advance payment/opening of letter of credit. GST other statutory levies will be extra as applicable.
For purchase against usance Letter of Credit the interest rate chargeable shall be 10 per annum for the credit
period up to 90/60/30 days.
Customers may note that the price and interest rate is subject to change without prior notice. The price and interest rate
ruling on the date of delivery will be applicable irrespective of the date of their making financial arrangements. All bank
charges of negotiating bank will be borne by us.
ADD YAS
Zl Bl rTeri68
S Parashar
DGM Commercial

【讨论】：

感谢您的回复。我使用了另一个包 magick 来旋转和读取图像并且它已经工作了。再次感谢。