【发布时间】:2021-10-16 13:34:05
【问题描述】:
我使用 Document OCR API 从 pdf 文件中提取文本,但其中一部分不准确。我发现原因可能是某些汉字的存在。
下面是一个虚构的例子,我将提取的文本错误的区域裁剪了一部分,并添加了一些汉字来重现问题。
当我使用website version时,无法获取汉字,但其余字符正确。
当我使用Python提取文本时,我可以正确获取中文字符但剩余的部分字符是错误的。
我得到的实际字符串。
网站和 API 中 Document AI 的版本是否不同?如何正确获取所有字符?
更新:
当我打印detected_languages(不知道为什么lines = page.lines,两行的detected_languages都是空列表,需要先更改为page.blocks或page.paragraphs)打印文本后,我得到以下输出。
代码:
from google.cloud import documentai_v1beta3 as documentai
project_id= 'secret-medium-xxxxxx'
location = 'us' # Format is 'us' or 'eu'
processor_id = 'abcdefg123456' # Create processor in Cloud Console
opts = {}
if location == "eu":
opts = {"api_endpoint": "eu-documentai.googleapis.com"}
client = documentai.DocumentProcessorServiceClient(client_options=opts)
def get_text(doc_element: dict, document: dict):
"""
Document AI identifies form fields by their offsets
in document text. This function converts offsets
to text snippets.
"""
response = ""
# If a text segment spans several lines, it will
# be stored in different text segments.
for segment in doc_element.text_anchor.text_segments:
start_index = (
int(segment.start_index)
if segment in doc_element.text_anchor.text_segments
else 0
)
end_index = int(segment.end_index)
response += document.text[start_index:end_index]
return response
def get_lines_of_text(file_path: str, location: str = location, processor_id: str = processor_id, project_id: str = project_id):
# You must set the api_endpoint if you use a location other than 'us', e.g.:
# opts = {}
# if location == "eu":
# opts = {"api_endpoint": "eu-documentai.googleapis.com"}
# The full resource name of the processor, e.g.:
# projects/project-id/locations/location/processor/processor-id
# You must create new processors in the Cloud Console first
name = f"projects/{project_id}/locations/{location}/processors/{processor_id}"
# Read the file into memory
with open(file_path, "rb") as image:
image_content = image.read()
document = {"content": image_content, "mime_type": "application/pdf"}
# Configure the process request
request = {"name": name, "raw_document": document}
result = client.process_document(request=request)
document = result.document
document_pages = document.pages
response_text = []
# For a full list of Document object attributes, please reference this page: https://googleapis.dev/python/documentai/latest/_modules/google/cloud/documentai_v1beta3/types/document.html#Document
# Read the text recognition output from the processor
print("The document contains the following paragraphs:")
for page in document_pages:
lines = page.blocks
for line in lines:
block_text = get_text(line.layout, document)
confidence = line.layout.confidence
response_text.append((block_text[:-1] if block_text[-1:] == '\n' else block_text, confidence))
print(f"Text: {block_text}")
print("Detected Language", line.detected_languages)
return response_text
if __name__ == '__main__':
print(get_lines_of_text('/pdf path'))
语言代码好像不对,会影响结果吗?
【问题讨论】:
-
您应该将图像嵌入问题本身以使其成为完整的问题。一段时间后,外部链接会被破坏。
-
您能否提供有关您的方案的更多详细信息,因为您可以使用 Document AI OCR 和 Vision OCR 从 PDF 中获取文本。您要使用多少个 pdf 文件,有多少页有这些 PDF。你能分享你的python代码和你所有的步骤吗?
-
@PjoterS 我只是使用代码here 来获取文本。其他细节应该无助于提高 OCR 的准确性。
-
我把
paragraphs = page.paragraphs改成了lines = page.lines -
感谢您的代码。我还从您的代码和演示中得到了不同的输出,但是两者都使用
v1beta3,这很奇怪。它可能与不同的端点、语言字母识别或一些随机的东西有关。您使用 DAI OCR 有什么原因吗?您是否尝试将Vision API与DOCUMENT_TEXT_DETECTION或TEXT_DETECTION一起使用,就像Detect text in files (PDF/TIFF) 中提到的那样?如果您必须使用DAI OCR,您可以使用Issue Tracker 创建报告,供谷歌工程师进行验证。
标签: python google-cloud-platform ocr google-api-python-client cloud-document-ai