【问题标题】:Azure Databricks - Cannot export results from Databricks to blobAzure Databricks - 无法将结果从 Databricks 导出到 blob
【发布时间】:2022-01-21 15:08:30
【问题描述】:

我想将我的数据从 Databricks 导出到 Azure blob。我的 Databricks 命令从我的 blob 中选择一些 pdf,运行 Form Recognizer 并将输出结果导出到我的 blob 中。

这是我的代码:

    %pip install azure.storage.blob
    %pip install azure.ai.formrecognizer
    
  
    from azure.storage.blob import ContainerClient
    
    container_url = "https://mystorageaccount.blob.core.windows.net/pdf-raw"
    container = ContainerClient.from_container_url(container_url)
    
    for blob in container.list_blobs():
    blob_url = container_url + "/" + blob.name
    print(blob_url)


import requests
from azure.ai.formrecognizer import FormRecognizerClient
from azure.core.credentials import AzureKeyCredential

endpoint = "https://myendpoint.cognitiveservices.azure.com/"
key = "mykeynumber"

form_recognizer_client = FormRecognizerClient(endpoint, credential=AzureKeyCredential(key))

   
    import pandas as pd
    
    field_list = ["InvoiceDate","InvoiceID","Items","VendorName"]
    df = pd.DataFrame(columns=field_list)
    
    for blob in container.list_blobs():
        blob_url = container_url + "/" + blob.name
        poller = form_recognizer_client.begin_recognize_invoices_from_url(invoice_url=blob_url)
        invoices = poller.result()
        print("Scanning " + blob.name + "...")
    
        for idx, invoice in enumerate(invoices):
            single_df = pd.DataFrame(columns=field_list)
            
            for field in field_list:
                entry = invoice.fields.get(field)
                
                if entry:
                    single_df[field] = [entry.value]
                    
                single_df['FileName'] = blob.name
                df = df.append(single_df)
                
    df = df.reset_index(drop=True)
    df
    

    account_name = "mystorageaccount"
    account_key = "fs.azure.account.key." + account_name + ".blob.core.windows.net"
    
    try:
        dbutils.fs.mount(
            source = "wasbs://pdf-recognized@mystorageaccount.blob.core.windows.net",
            mount_point = "/mnt/pdf-recognized",
            extra_configs = {account_key: dbutils.secrets.get(scope ="formrec", key="formreckey")} )
        
    except:
        print('Directory already mounted or error')
    
    df.to_csv(r"/dbfs/mnt/pdf-recognized/output.csv", index=False)

代码运行良好,直到最后一行。我收到以下错误消息: Directory already mounted or error. FileNotFoundError: [Errno 2] No such file or directory: '/dbfs/mnt/pdf-recognized/output.csv'.

我尝试使用 /dbfs:/ 而不是 /dbfs/ 但我不知道我做错了什么。

如何将我的 Databricks 结果导出到 blob?

谢谢

【问题讨论】:

    标签: azure export blob databricks azure-databricks


    【解决方案1】:

    您似乎正在尝试挂载已挂载的存储。实际上,挂载操作应该只执行一次,而不是动态执行。您有多种选择来正确实现它:

    • 在安装前使用dbutils.fs.unmount("/mnt/pdf-recognized")卸载

    • 检查存储是否已挂载,如果未挂载,则仅运行 mount。像这样的东西(未测试)

    mounts = [mount for mount in dbutils.fs.mounts() 
          if mount.mountPoint == "/mnt/pdf-recognized"]
    if len(mounts) == 0:
      dbutils.fs.mount(....)
    
    • 您实际上并不需要挂载 - 它具有“坏”属性,工作区中的任何人都可以通过用于挂载的权限使用它。将结果写入本地磁盘可能会更简单,然后使用dbutils.fs.cpwasbs 协议将文件复制到必要的位置。像这样:
    df.to_csv(r"/tmp/my-output.csv", index=False)
    spark.conf.set(account_key, dbutils.secrets.get(scope ="formrec", key="formreckey"))
    dbutils.fs.cp("file:///tmp/my-output.csv"), 
       "wasbs://pdf-recognized@mystorageaccount.blob.core.windows.net/output.csv")
    

    【讨论】:

    • 感谢您的快速回答亚历克斯。我用黄蜂协议尝试了第三种方法。对于 line spark.conf.set... ,我不断收到以下错误消息: IllegalArgumentException: Secret does not exist with scope: formrec and key: formreckey.您知道可能导致此错误的原因吗?
    • 您只是没有秘密的密钥。我从你的代码中复制了它
    猜你喜欢
    • 2019-05-31
    • 2021-07-11
    • 2020-11-15
    • 2022-06-16
    • 1970-01-01
    • 2019-07-28
    • 2019-08-06
    • 1970-01-01
    • 2021-11-23
    相关资源
    最近更新 更多