- 我想使用 PyTesseract 和 OpenCV 将(数百)页如下信息读取为 JSON 或 CSV。
您有多种选择,xlswriter、pandas 等。例如,您可以查看tutorial for xlswriter。
- 如何让 tesseract 知道中间的实线分隔两列信息?
你不能。您需要手动将图像按宽度分成两部分。例如:first-part、second-part
如何手动按宽度分割图片?
先获取图片大小,再设置索引。
# Get the size
(h, w) = img.shape[:2]
# First part
first_part = img[0:h, 0:int(w/2)]
# Second part
second_part = img[0:h, int(w/2):w]
- 此外,有些数据行是 2 行而不是 1 行。对此有什么最好的解释?
Tesseract 将说明这一点,但您需要了解以下内容:
输入图像不包含伪影。因此,乍一看,图像预处理似乎是不必要的。您仍然可以申请binarisation 以确保获得最佳准确性。
| Part-1 |
Part - 2 |
|
|
Northpor. 13620 Highway 43 North Hardikkumar Patel (205) 339-1188 Northport 1836 McFaland Bld Harikkumar Patel {205} 339-1782 Northport 5550 McFarland Bld Sharmishta Patel (205) 200-7822 Odenvile 130 Council Orive Gratton Curbow (205) 629-7827 Oneonta 511 nd Ave E Govinddhai Patel (205) 625.5847
velxa 1017 Columbus Parkway Luis Cribb (934) 749-3628
alka 2300 Gateway Or Donna Cribb (234) 749-2308 pp 101 Stewart Ave ‘Utpa! Patel {334} 433-7325, Orange Beach 25755 Perdido Beach Bhd. Patrick Shedd (251) 981-6881 Orange Beach 25814 Canal Rd Patrick Shed (251) 91-4184 Owens Crossroads 6707 Hwy 43) South Richard Hyde (256) 519-2425 Owns Gross Road 330 Sutton Road Richard Hyde (256) 518-2004 . . . . . . . . . |
Talladega 244 Haynes SI tus Crisp (258} 315-0191 Talladega 608 East Batle Street Luis Cribb (256) 362-0781 Tallassee 454 Gimere Ave Donna Cribb (334) 283-2067 Tanner 5956 Hwy 31 N Mike Nadesi (256) 352-9808 Torani 1806 Pingon Valley Re Sanjayknat Patel (205) 849-0112 Theodore 5827 Hwy SOW Mukeshkumar Soparwala (251) 854-0048, Theodore 6860 Theodore Dawes Rd Anthony Laf enier (251) 853-2010 Thamasvilie 33202 Hwy 43 Ranjeev Acharya (334) 636-0333 Thomasville 3430S Huy 43 Ranjeey Acharya (334) 636-0830 Tius 80 Tus Road Garret Gray (G34) 514-9930 Town Creek 2795 Hwy 20 Madhav Maina (256) 686-3900 Troy 1003 Highway 231 South Luis Cribb (339) 568-7944 Troy 1420 US 231 South Dehua Patel (334) 670-6390 . . . . . . . . .
|
图像会重新缩放以适合大小。正如我们所见,我们可以通过将图像假设为单个统一的文本块来获得输出。
结果怎么写?
-
首先,您需要将 OCR 结果存储在列表中。
-
其次,您需要对列表元组进行配对。
-
for txt1, txt2 in zip(part1, part2):
worksheet.write(row, col, txt1)
worksheet.write(row, col + 1, txt2)
row += 1
zip 函数使我们能够从每个迭代器的每一列中获取一对数据。然后我们将值写入相应的列。
Excel 中的某些数据可能不准确。如果是这种情况,那么您需要尝试使用不同页面分割模式的不同处理方法。
代码:
# Load the libraries
import cv2
import pytesseract
import xlsxwriter
# Load the image in BGR format
img = cv2.imread("WFJO2.jpg")
# Initialize the workbook
workbook = xlsxwriter.Workbook('result.xlsx')
worksheet = workbook.add_worksheet()
row = 0
col = 0
part1 = []
part2 = []
# Get the size
(h, w) = img.shape[:2]
# Initialize indexes
increase = int(w / 2)
start = 0
end = start + increase
# For each part
for i in range(0, 2):
# Get the current part
cropped = img[0:h, start:end]
# Convert to the gray-scale
gry = cv2.cvtColor(cropped, cv2.COLOR_BGR2GRAY)
# Threshold
thr = cv2.threshold(gry, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)[1]
# OCR
txt = pytesseract.image_to_string(thr, config="--psm 6")
# Add ocr to the corresponding part
txt = txt.split("\n")
if i == 0:
for sentence in txt:
part1.append(sentence)
else:
for sentence in txt:
part2.append(sentence)
# Set indexes
start = end
end = start + increase
for txt1, txt2 in zip(part1, part2):
worksheet.write(row, col, txt1)
worksheet.write(row, col + 1, txt2)
row += 1
workbook.close()