拆箱opencv矩形答案

【问题标题】：Unboxing opencv rectangles拆箱opencv矩形
【发布时间】：2021-09-10 05:06:29
【问题描述】：

我正在对一堆 pdf 文件进行 OCR 处理。这很好用，但部分 pdf 文件是黑线的。实际上，它们并不是真正的黑线，而是“与矩形内的一些文本组成的矩形”。这段文字弄乱了我的 OCR，即使在使用单词列表来定位 '(10)(2e)' 的各种组合时也是如此。

我正在使用 .jpg 文件，它是从包含机器人文本和图像（其中包含文本）的 pdf 文件转换而来的。这是一个示例：

由于 '(10)(2e)' 的许多变体都弄乱了我的 OCR，我的目标是找到所有矩形 - 最有可能包含 '(10)(2e)' 并填充它们。为了找到矩形，我从nathancy 那里得到了这个很好的答案：How to detect all rectangular boxes python opencv without missing anything

但是 - 正如您在上方的绿色矩形中看到的那样 - 有时绿色矩形与我需要的部分数据重叠。在这种情况下，“@leiden.nl”和“@”在第二行。

我已经尝试了许多组合（a）图像处理的其他设置（侵蚀/扩张/模糊/thershold）和（b）Nathancy 的答案中建议的其他设置（内核设置/迭代次数）。

查找较小矩形的最佳做法是什么？

仅供参考：我查找矩形的代码或多或少类似于 Nathancy 的回答：

# https://stackoverflow.com/questions/59979760/how-to-detect-all-rectangular-boxes-python-opencv-without-missing-anything
import cv2

import os
path = os.getcwd()
print(path+'/test_ocr3/_stuff_IN/')

# Load iamge, grayscale, adaptive threshold
# image = cv2.imread(path+'/test_ocr3/_stuff_OUT/'+'1.png')
# image = cv2.imread(path+'/test_ocr3/_stuff_OUT/'+'page_1.jpg')
image = cv2.imread(path+'/test_ocr3/_stuff_OUT/'+'page_1_opt.jpg')
# image = cv2.imread(path+'/test_ocr3/_stuff_OUT/'+'page_1_A_erode_551.jpg')
# image = cv2.imread(path+'/test_ocr3/_stuff_OUT/'+'page_1_B_dilate_551.jpg')
# image = cv2.imread(path+'/test_ocr3/_stuff_OUT/'+'page_1_D_threshold_177255.jpg')
result = image.copy()
gray = cv2.cvtColor(image,cv2.COLOR_BGR2GRAY)
thresh = cv2.adaptiveThreshold(gray,255,cv2.ADAPTIVE_THRESH_GAUSSIAN_C, cv2.THRESH_BINARY_INV,51,9)

# Fill rectangular contours
# CHECK OTHER CONTOUR SETTINGS ? TO EXLCUDE OUTER ?
# https://docs.opencv.org/master/d9/d8b/tutorial_py_contours_hierarchy.html
# https://medium.com/analytics-vidhya/opencv-findcontours-detailed-guide-692ee19eeb18
# cnts = cv2.findContours(thresh, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
cnts = cv2.findContours(thresh, cv2.RETR_CCOMP, cv2.CHAIN_APPROX_SIMPLE)
cnts = cnts[0] if len(cnts) == 2 else cnts[1]
for c in cnts:
    cv2.drawContours(thresh, [c], -1, (255,255,255), -1)

# Morph open
kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (30,4))
opening = cv2.morphologyEx(thresh, cv2.MORPH_OPEN, kernel, iterations=4)
# opening = cv2.morphologyEx(thresh, cv2.MORPH_CLOSE, kernel, iterations=4)

# Draw rectangles
# cnts = cv2.findContours(opening, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
cnts = cv2.findContours(opening, cv2.RETR_TREE, cv2.CHAIN_APPROX_SIMPLE)
cnts = cnts[0] if len(cnts) == 2 else cnts[1]
for c in cnts:
    x,y,w,h = cv2.boundingRect(c)
    cv2.rectangle(image, (x, y), (x + w, y + h), (36,255,12), 3)
    # filled
    # cv2.rectangle(image, (x, y), (x + w, y + h), (36,255,12), -1)
    
# cv2.imwrite(path+'/test_ocr3/_stuff_OUT/'+'1_OUT.png', image)
cv2.imwrite(path+'/test_ocr3/_stuff_OUT/'+'page_1_0_TST_OUT.jpg', image)

【问题讨论】：

它是(10)(2e)。您需要更高分辨率的数据。
确实是 (10)(2e) 而不是 (10x2e)。谢谢你。摆弄扩张/侵蚀之类的东西并不能帮助我摆脱'（10）（2e）'......这就是我开始使用cv的findContours的原因。因此我的问题是如何适应大矩形。
我建议使用 pdf 本身。它可能包含此信息。
@ChristophRackwitz ...不幸的是，有些页面中嵌入了“电子邮件图像”。我不想泄露这些信息。这就是我将所有 pdf 转换为 jpg 的原因。
我建议至少使用更高的分辨率。最低 600 dpi，如果您想获得不错的 OCR。

标签： python opencv image-processing contour hierarchical

【解决方案1】：

# https://stackoverflow.com/questions/59979760/how-to-detect-all-rectangular-boxes-python-opencv-without-missing-anything
import cv2
import os

path = os.getcwd()
print(path + '/test_ocr3/_stuff_IN/')

# Load iamge, grayscale, adaptive threshold
# image = cv2.imread(path+'/test_ocr3/_stuff_OUT/'+'1.png')
# image = cv2.imread(path+'/test_ocr3/_stuff_OUT/'+'page_1.jpg')
image = cv2.imread(path+'/test_ocr3/_stuff_OUT/'+'page_1_opt.jpg')
# image = cv2.imread(path+'/test_ocr3/_stuff_OUT/'+'page_1_A_erode_551.jpg')
# image = cv2.imread(path+'/test_ocr3/_stuff_OUT/'+'page_1_B_dilate_551.jpg')
# image = cv2.imread(path+'/test_ocr3/_stuff_OUT/'+'page_1_D_threshold_177255.jpg')
result = image.copy()
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
thresh = cv2.adaptiveThreshold(gray, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C, cv2.THRESH_BINARY_INV, 51, 9)

# Fill rectangular contours
# CHECK OTHER CONTOUR SETTINGS ? TO EXLCUDE OUTER ?
# https://docs.opencv.org/master/d9/d8b/tutorial_py_contours_hierarchy.html
# https://medium.com/analytics-vidhya/opencv-findcontours-detailed-guide-692ee19eeb18
# cnts = cv2.findContours(thresh, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
cnts = cv2.findContours(thresh, cv2.RETR_CCOMP, cv2.CHAIN_APPROX_SIMPLE)
cnts = cnts[0] if len(cnts) == 2 else cnts[1]
for c in cnts:
    cv2.drawContours(thresh, [c], -1, (255, 255, 255), -1)
    cv2.drawContours(thresh, [c], -1, (0, 0, 0), 1)

# Morph open
kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (7, 4))
opening = cv2.morphologyEx(thresh, cv2.MORPH_OPEN, kernel, iterations=4)
# opening = cv2.morphologyEx(thresh, cv2.MORPH_CLOSE, kernel, iterations=4)

# Draw rectangles
# cnts = cv2.findContours(opening, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
cnts = cv2.findContours(opening, cv2.RETR_TREE, cv2.CHAIN_APPROX_SIMPLE)
cnts = cnts[0] if len(cnts) == 2 else cnts[1]
for c in cnts:
    x, y, w, h = cv2.boundingRect(c)
    cv2.rectangle(image, (x, y), (x + w, y + h), (36, 255, 12), 3)
    # filled
    # cv2.rectangle(image, (x, y), (x + w, y + h), (36,255,12), -1)

# cv2.imwrite(path+'/test_ocr3/_stuff_OUT/'+'1_OUT.png', image)
cv2.imwrite(path+'/test_ocr3/_stuff_OUT/'+'page_1_0_TST_OUT.jpg', image)

Modified binary 因为我没有更高分辨率的图像，所以我修改了图像。我用手擦掉了大盒子，把边缘锐化到1px（如果这张图片不等于你的原始图片，请上传更高的分辨率并更正。）。

关键是 cv2.drawContours(thresh, [c], -1, (0, 0, 0), 1) .这将一个大盒子（您想要删除）分成小盒子。没有这个，连接区域将被识别为一个大盒子，这将删除不需要的信息。

image 2 compares Your question and My answer of large box. image 3 shows My answer.

【讨论】：

谢谢@mason-ji-ming ...我只是将 cv2.MORPH_RECT 调整为 (3,20) 左右的某个值，以包含尽可能多的正方形和尽可能少的文本。 link