Google Document Ai 为同一文件提供不同的输出答案

【问题标题】：Google Document Ai giving different outputs for the same fileGoogle Document Ai 为同一文件提供不同的输出
【发布时间】：2021-10-16 13:34:05
【问题描述】：

我使用 Document OCR API 从 pdf 文件中提取文本，但其中一部分不准确。我发现原因可能是某些汉字的存在。

下面是一个虚构的例子，我将提取的文本错误的区域裁剪了一部分，并添加了一些汉字来重现问题。

当我使用website version时，无法获取汉字，但其余字符正确。

当我使用Python提取文本时，我可以正确获取中文字符但剩余的部分字符是错误的。

我得到的实际字符串。

网站和 API 中 Document AI 的版本是否不同？如何正确获取所有字符？

更新：

当我打印detected_languages（不知道为什么lines = page.lines，两行的detected_languages都是空列表，需要先更改为page.blocks或page.paragraphs）打印文本后，我得到以下输出。

代码：

from google.cloud import documentai_v1beta3 as documentai

project_id= 'secret-medium-xxxxxx'
location = 'us' # Format is 'us' or 'eu'
processor_id = 'abcdefg123456' #  Create processor in Cloud Console

opts = {}
if location == "eu":
    opts = {"api_endpoint": "eu-documentai.googleapis.com"}
client = documentai.DocumentProcessorServiceClient(client_options=opts)

def get_text(doc_element: dict, document: dict):
    """
    Document AI identifies form fields by their offsets
    in document text. This function converts offsets
    to text snippets.
    """
    response = ""
    # If a text segment spans several lines, it will
    # be stored in different text segments.
    for segment in doc_element.text_anchor.text_segments:
        start_index = (
            int(segment.start_index)
            if segment in doc_element.text_anchor.text_segments
            else 0
        )
        end_index = int(segment.end_index)
        response += document.text[start_index:end_index]
    return response

def get_lines_of_text(file_path: str, location: str = location, processor_id: str = processor_id, project_id: str = project_id):

    # You must set the api_endpoint if you use a location other than 'us', e.g.:
    # opts = {}
    # if location == "eu":
    #     opts = {"api_endpoint": "eu-documentai.googleapis.com"}

    # The full resource name of the processor, e.g.:
    # projects/project-id/locations/location/processor/processor-id
    # You must create new processors in the Cloud Console first
    name = f"projects/{project_id}/locations/{location}/processors/{processor_id}"

    # Read the file into memory
    with open(file_path, "rb") as image:
    image_content = image.read()

    document = {"content": image_content, "mime_type": "application/pdf"}

    # Configure the process request
    request = {"name": name, "raw_document": document}

    result = client.process_document(request=request)
    document = result.document

    document_pages = document.pages

    response_text = []
    # For a full list of Document object attributes, please reference this page: https://googleapis.dev/python/documentai/latest/_modules/google/cloud/documentai_v1beta3/types/document.html#Document

    # Read the text recognition output from the processor
    print("The document contains the following paragraphs:")
    for page in document_pages:
        lines = page.blocks
        for line in lines:
            block_text = get_text(line.layout, document)
            confidence = line.layout.confidence
            response_text.append((block_text[:-1] if block_text[-1:] == '\n' else block_text, confidence))
            print(f"Text: {block_text}")
            print("Detected Language", line.detected_languages)
    return response_text

if __name__ == '__main__':
    print(get_lines_of_text('/pdf path'))

语言代码好像不对，会影响结果吗？

【问题讨论】：

您应该将图像嵌入问题本身以使其成为完整的问题。一段时间后，外部链接会被破坏。
您能否提供有关您的方案的更多详细信息，因为您可以使用 Document AI OCR 和 Vision OCR 从 PDF 中获取文本。您要使用多少个 pdf 文件，有多少页有这些 PDF。你能分享你的python代码和你所有的步骤吗？
@PjoterS 我只是使用代码here 来获取文本。其他细节应该无助于提高 OCR 的准确性。
我把paragraphs = page.paragraphs改成了lines = page.lines
感谢您的代码。我还从您的代码和演示中得到了不同的输出，但是两者都使用v1beta3，这很奇怪。它可能与不同的端点、语言字母识别或一些随机的东西有关。您使用 DAI OCR 有什么原因吗？您是否尝试将Vision API 与DOCUMENT_TEXT_DETECTION 或TEXT_DETECTION 一起使用，就像Detect text in files (PDF/TIFF) 中提到的那样？如果您必须使用DAI OCR，您可以使用Issue Tracker 创建报告，供谷歌工程师进行验证。

标签： python google-cloud-platform ocr google-api-python-client cloud-document-ai

【解决方案1】：

发布此Community Wiki 以获得更好的visibility。

DocumentAI 的功能之一是OCR - Optical Character Recognition，它允许从各种文件中识别文本。

此场景中的 OP 使用 Try it 函数和 Client Libraries - Python 接收不同的输出。

为什么Try it 和Python library 之间存在差异？很难说，因为这两种方法都使用相同的 API documentai_v1beta3。这可能与pdf上传到Try it Demo时的一些文件修改、不同的端点、语言字母识别或一些随机的东西有关。

当您使用Python Client 时，您还可以获得文本识别的准确率百分比。下面是我睾丸的例子：

但是，OP 的标识大约是0,73，因此它可能会得到错误的结果，在这种情况下是一个明显的问题。我想它无论如何都无法使用代码进行改进。也许如果 PDF 的质量不同（在所示的 OP 示例中，有些点可能会影响识别）。

【讨论】：

您好@iter07，欢迎来到 StackOverflow！请记得react to answers for your questions。这样我们就知道答案是否有帮助，其他社区成员也可以从中受益。尝试accept answer，这是您问题的最终解决方案，对有帮助的答案进行投票，并对可以改进或需要额外关注的答案发表评论。祝您住宿愉快！
你能解释一下什么是社区维基吗？我不知道那是什么...
嗨@Wytrzymały Wiktor，感谢您的欢迎。我只是在寻找可以提高准确性的解决方案。我已经向 Google 发布了一个问题，但没有得到任何回复。
社区 wiki 是可以由社区以较少的工作量维护的帖子，并且不会为作者提供声誉收益。简而言之，当没有问题的解决方案但提供了根本原因的一些可能性或提供可以帮助其他社区成员解决类似问题的信息时使用它。它可以由其他用户修改，因此当将来修复某些内容时，它可能会被更改。更多详情可以查看here
@PjoterS 好的，我现在知道了。感谢您的社区维基。