【发布时间】:2022-01-21 15:08:30
【问题描述】:
我想将我的数据从 Databricks 导出到 Azure blob。我的 Databricks 命令从我的 blob 中选择一些 pdf,运行 Form Recognizer 并将输出结果导出到我的 blob 中。
这是我的代码:
%pip install azure.storage.blob
%pip install azure.ai.formrecognizer
from azure.storage.blob import ContainerClient
container_url = "https://mystorageaccount.blob.core.windows.net/pdf-raw"
container = ContainerClient.from_container_url(container_url)
for blob in container.list_blobs():
blob_url = container_url + "/" + blob.name
print(blob_url)
import requests
from azure.ai.formrecognizer import FormRecognizerClient
from azure.core.credentials import AzureKeyCredential
endpoint = "https://myendpoint.cognitiveservices.azure.com/"
key = "mykeynumber"
form_recognizer_client = FormRecognizerClient(endpoint, credential=AzureKeyCredential(key))
import pandas as pd
field_list = ["InvoiceDate","InvoiceID","Items","VendorName"]
df = pd.DataFrame(columns=field_list)
for blob in container.list_blobs():
blob_url = container_url + "/" + blob.name
poller = form_recognizer_client.begin_recognize_invoices_from_url(invoice_url=blob_url)
invoices = poller.result()
print("Scanning " + blob.name + "...")
for idx, invoice in enumerate(invoices):
single_df = pd.DataFrame(columns=field_list)
for field in field_list:
entry = invoice.fields.get(field)
if entry:
single_df[field] = [entry.value]
single_df['FileName'] = blob.name
df = df.append(single_df)
df = df.reset_index(drop=True)
df
account_name = "mystorageaccount"
account_key = "fs.azure.account.key." + account_name + ".blob.core.windows.net"
try:
dbutils.fs.mount(
source = "wasbs://pdf-recognized@mystorageaccount.blob.core.windows.net",
mount_point = "/mnt/pdf-recognized",
extra_configs = {account_key: dbutils.secrets.get(scope ="formrec", key="formreckey")} )
except:
print('Directory already mounted or error')
df.to_csv(r"/dbfs/mnt/pdf-recognized/output.csv", index=False)
代码运行良好,直到最后一行。我收到以下错误消息:
Directory already mounted or error. FileNotFoundError: [Errno 2] No such file or directory: '/dbfs/mnt/pdf-recognized/output.csv'.
我尝试使用 /dbfs:/ 而不是 /dbfs/ 但我不知道我做错了什么。
如何将我的 Databricks 结果导出到 blob?
谢谢
【问题讨论】:
标签: azure export blob databricks azure-databricks