【问题标题】：Tesseract Function to Split Into 2 Columns将 Tesseract 函数拆分为 2 列
【发布时间】：2021-03-01 01:34:17
【问题描述】：

我想使用 PyTesseract 和 OpenCV 将（数百）页如下信息读入 JSON 或 CSV。如何让 tesseract 知道中间的实线分隔两列信息？此外，有些数据行是 2 行，而不是 1 行。最好的解决方法是什么？

我在使用 tesseract 方面还很陌生，如有任何帮助，我们将不胜感激！

编辑！！

这就是我现在拥有的：

# OCR
txt = pytesseract.image_to_string(thr, config="--psm 11")

# Add ocr to the corresponding part
txt = txt.split("\n")

row = 0
col = 0

for txt1 in txt:

    # Skip over OCR strings that are just spaces or ''
    if txt1.isspace() or txt1 == '':
        continue

    # Hard code detection in...let's just place it into the last column for now
    # Theoretically, the state ("Alaska" in this case) will be in column 0 in the same row
    if re.match(r"\d*\sOpen\sRestaurants", txt1):
        col == 3
        
    worksheet.write(row//4, col%4, txt1)
    col += 1
    row += 1

workbook.close()

此代码块上方的所有内容都是相同的。

但是，仍然存在很多错位，尤其是当某些地址或名称超过一行时。此外，为什么第一行的文本与其余行的读取顺序不同？

我在想也许我可以强制每四个 txt 按字母顺序排列并使用它来检测错位？但是，即使第一行不正确，我也不确定我想对更正进行多少硬编码。此外，有时多行条目来自地址列，而其他时间则来自名称列（例如，页面左侧的 258 Interstate Commercial Park Loop）。

以下是左侧混淆的一些屏幕截图：

在右边：

【问题讨论】：

标签： ocr tesseract python-tesseract

【解决方案1】：

我想使用 PyTesseract 和 OpenCV 将（数百）页如下信息读取为 JSON 或 CSV。

您有多种选择，xlswriter、pandas 等。例如，您可以查看tutorial for xlswriter。

如何让 tesseract 知道中间的实线分隔两列信息？

你不能。您需要手动将图像按宽度分成两部分。例如：first-part、second-part

如何手动按宽度分割图片？

先获取图片大小，再设置索引。

# Get the size
(h, w) = img.shape[:2]

# First part
first_part = img[0:h, 0:int(w/2)]

# Second part
second_part = img[0:h, int(w/2):w]

此外，有些数据行是 2 行而不是 1 行。对此有什么最好的解释？

Tesseract 将说明这一点，但您需要了解以下内容：

输入图像不包含伪影。因此，乍一看，图像预处理似乎是不必要的。您仍然可以申请binarisation 以确保获得最佳准确性。

Part-1	Part - 2

Northpor. 13620 Highway 43 North Hardikkumar Patel (205) 339-1188 Northport 1836 McFaland Bld Harikkumar Patel {205} 339-1782 Northport 5550 McFarland Bld Sharmishta Patel (205) 200-7822 Odenvile 130 Council Orive Gratton Curbow (205) 629-7827 Oneonta 511 nd Ave E Govinddhai Patel (205) 625.5847 velxa 1017 Columbus Parkway Luis Cribb (934) 749-3628 alka 2300 Gateway Or Donna Cribb (234) 749-2308 pp 101 Stewart Ave ‘Utpa! Patel {334} 433-7325, Orange Beach 25755 Perdido Beach Bhd. Patrick Shedd (251) 981-6881 Orange Beach 25814 Canal Rd Patrick Shed (251) 91-4184 Owens Crossroads 6707 Hwy 43) South Richard Hyde (256) 519-2425 Owns Gross Road 330 Sutton Road Richard Hyde (256) 518-2004 . . . . . . . . .	Talladega 244 Haynes SI tus Crisp (258} 315-0191 Talladega 608 East Batle Street Luis Cribb (256) 362-0781 Tallassee 454 Gimere Ave Donna Cribb (334) 283-2067 Tanner 5956 Hwy 31 N Mike Nadesi (256) 352-9808 Torani 1806 Pingon Valley Re Sanjayknat Patel (205) 849-0112 Theodore 5827 Hwy SOW Mukeshkumar Soparwala (251) 854-0048, Theodore 6860 Theodore Dawes Rd Anthony Laf enier (251) 853-2010 Thamasvilie 33202 Hwy 43 Ranjeev Acharya (334) 636-0333 Thomasville 3430S Huy 43 Ranjeey Acharya (334) 636-0830 Tius 80 Tus Road Garret Gray (G34) 514-9930 Town Creek 2795 Hwy 20 Madhav Maina (256) 686-3900 Troy 1003 Highway 231 South Luis Cribb (339) 568-7944 Troy 1420 US 231 South Dehua Patel (334) 670-6390 . . . . . . . . .

Part-1

Part - 2

Northpor. 13620 Highway 43 North Hardikkumar Patel (205) 339-1188
Northport 1836 McFaland Bld Harikkumar Patel {205} 339-1782
Northport 5550 McFarland Bld Sharmishta Patel (205) 200-7822
Odenvile 130 Council Orive Gratton Curbow (205) 629-7827
Oneonta 511 nd Ave E Govinddhai Patel (205) 625.5847

velxa 1017 Columbus Parkway Luis Cribb (934) 749-3628

alka 2300 Gateway Or Donna Cribb (234) 749-2308
pp 101 Stewart Ave ‘Utpa! Patel {334} 433-7325,
Orange Beach 25755 Perdido Beach Bhd. Patrick Shedd (251) 981-6881
Orange Beach 25814 Canal Rd Patrick Shed (251) 91-4184
Owens Crossroads 6707 Hwy 43) South Richard Hyde (256) 519-2425
Owns Gross Road 330 Sutton Road Richard Hyde (256) 518-2004
. . .
. . .
. . .

Talladega 244 Haynes SI tus Crisp (258} 315-0191
Talladega 608 East Batle Street Luis Cribb (256) 362-0781
Tallassee 454 Gimere Ave Donna Cribb (334) 283-2067
Tanner 5956 Hwy 31 N Mike Nadesi (256) 352-9808
Torani 1806 Pingon Valley Re Sanjayknat Patel (205) 849-0112
Theodore 5827 Hwy SOW Mukeshkumar Soparwala (251) 854-0048,
Theodore 6860 Theodore Dawes Rd Anthony Laf enier (251) 853-2010
Thamasvilie 33202 Hwy 43 Ranjeev Acharya (334) 636-0333
Thomasville 3430S Huy 43 Ranjeey Acharya (334) 636-0830
Tius 80 Tus Road Garret Gray (G34) 514-9930
Town Creek 2795 Hwy 20 Madhav Maina (256) 686-3900
Troy 1003 Highway 231 South Luis Cribb (339) 568-7944
Troy 1420 US 231 South Dehua Patel (334) 670-6390
. . .
. . .
. . .

图像会重新缩放以适合大小。正如我们所见，我们可以通过将图像假设为单个统一的文本块来获得输出。

结果怎么写？

首先，您需要将 OCR 结果存储在列表中。

if i == 0:
    for sentence in txt:
        part1.append(sentence)
else:
    for sentence in txt:
        part2.append(sentence)

其次，您需要对列表元组进行配对。

for txt1, txt2 in zip(part1, part2):
    worksheet.write(row, col, txt1)
    worksheet.write(row, col + 1, txt2)
    row += 1

zip 函数使我们能够从每个迭代器的每一列中获取一对数据。然后我们将值写入相应的列。

Excel 中的某些数据可能不准确。如果是这种情况，那么您需要尝试使用不同页面分割模式的不同处理方法。

代码：

# Load the libraries
import cv2
import pytesseract
import xlsxwriter

# Load the image in BGR format
img = cv2.imread("WFJO2.jpg")

# Initialize the workbook
workbook = xlsxwriter.Workbook('result.xlsx')
worksheet = workbook.add_worksheet()

row = 0
col = 0

part1 = []
part2 = []

# Get the size
(h, w) = img.shape[:2]

# Initialize indexes
increase = int(w / 2)
start = 0
end = start + increase

# For each part
for i in range(0, 2):

    # Get the current part
    cropped = img[0:h, start:end]

    # Convert to the gray-scale
    gry = cv2.cvtColor(cropped, cv2.COLOR_BGR2GRAY)

    # Threshold
    thr = cv2.threshold(gry, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)[1]

    # OCR
    txt = pytesseract.image_to_string(thr, config="--psm 6")

    # Add ocr to the corresponding part
    txt = txt.split("\n")

    if i == 0:
        for sentence in txt:
            part1.append(sentence)
    else:
        for sentence in txt:
            part2.append(sentence)

    # Set indexes
    start = end
    end = start + increase

for txt1, txt2 in zip(part1, part2):
    worksheet.write(row, col, txt1)
    worksheet.write(row, col + 1, txt2)
    row += 1

workbook.close()

【讨论】：

有没有办法将城市、地址、姓名和电话号码的（隐含）列插入工作簿的 4 个不同列中？同样，有没有办法让 pytesseract 知道这些隐含的列？它可以改善转录。
另外，我们能否将表示新状态的行与表示每个单独餐厅的其他行区分开来？例如。右图为“Alaska 61 Open Restaurants”。
我认为你可以做到，你卡在哪里了？
我已经更新了我原来的问题，包括我修改的代码和我遇到的问题的一些截图/描述。提前非常感谢您！
对此的任何帮助将不胜感激！