【问题标题】:Writing parquet file throws...An HTTP header that's mandatory for this request is not specified写入 parquet 文件会抛出...未指定此请求所必需的 HTTP 标头
【发布时间】:2021-01-27 06:50:36
【问题描述】:

我有两个 ADLSv2 存储帐户,都启用了分层命名空间。 在我的 Python Notebook 中,我正在从一个存储帐户读取一个 CSV 文件,并在经过一些扩充后将其作为 parquet 文件写入另一个存储。

我在编写 parquet 文件时遇到错误...

StatusCode=400, An HTTP header that's mandatory for this request is not 

非常感谢任何帮助。

下面是我的笔记本代码 sn-p...

# Databricks notebook source
# MAGIC %python
# MAGIC 
# MAGIC STAGING_MOUNTPOINT = "/mnt/inputfiles"
# MAGIC if STAGING_MOUNTPOINT in [mnt.mountPoint for mnt in dbutils.fs.mounts()]:
# MAGIC   dbutils.fs.unmount(STAGING_MOUNTPOINT)
# MAGIC 
# MAGIC PERM_MOUNTPOINT = "/mnt/outputfiles"
# MAGIC if PERM_MOUNTPOINT in [mnt.mountPoint for mnt in dbutils.fs.mounts()]:
# MAGIC   dbutils.fs.unmount(PERM_MOUNTPOINT)

STAGING_STORAGE_ACCOUNT = "--------"
STAGING_CONTAINER = "--------"
STAGING_FOLDER = --------"
PERM_STORAGE_ACCOUNT = "--------"
PERM_CONTAINER = "--------"

configs = {
 "fs.azure.account.auth.type": "OAuth",
 "fs.azure.account.oauth.provider.type": 
 "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
 "fs.azure.account.oauth2.client.id": "#####################",
 "fs.azure.account.oauth2.client.secret": dbutils.secrets.get(scope="DemoScope",key="DemoSecret"),
 "fs.azure.account.oauth2.client.endpoint": 
 "https://login.microsoftonline.com/**********************/oauth2/token"}

STAGING_SOURCE = 
"abfss://{container}@{storage_acct}.blob.core.windows.net/".format(container=STAGING_CONTAINER, 
storage_acct=STAGING_STORAGE_ACCOUNT)

try:
 dbutils.fs.mount(
  source=STAGING_SOURCE,
  mount_point=STAGING_MOUNTPOINT,
  extra_configs=configs)
except Exception as e:
 if "Directory already mounted" in str(e):
 pass # Ignore error if already mounted.
else:
 raise e

print("Staging Storage mount Success.")

inputDemoFile = "{}/{}/demo.csv".format(STAGING_MOUNTPOINT, STAGING_FOLDER)
readDF = (spark
          .read.option("header", True)
          .schema(inputSchema)
          .option("inferSchema", True)
          .csv(inputDemoFile))

LANDING_SOURCE = 
 "abfss://{container}@{storage_acct}.blob.core.windows.net/".format(container=LANDING_CONTAINER, 
 storage_acct=PERM_STORAGE_ACCOUNT)

try:
 dbutils.fs.mount(
 source=PERM_SOURCE,
 mount_point=PERM_MOUNTPOINT,
 extra_configs=configs)
except Exception as e:
 if "Directory already mounted" in str(e):
  pass # Ignore error if already mounted.
 else:
  raise e

print("Landing Storage mount Success.")

outPatientsFile = "{}/patients.parquet".format(outPatientsFilePath)
print("Writing to parquet file: " + outPatientsFile)

***Below call is failing…error is 
StatusCode=400
StatusDescription=An HTTP header that's mandatory for this request is not specified.
ErrorCode=
ErrorMessage=***

(readDF
 .coalesce(1)
 .write
 .mode("overwrite")
 .option("header", "true")
 .option("compression", "snappy")
 .parquet(outPatientsFile)
)

【问题讨论】:

  • 嗨。如果您在 databricks 中使用 Azure blob 存储,则应使用 wasbs 协议访问 blob:docs.microsoft.com/en-us/azure/databricks/data/data-sources/…
  • 我正在使用 abfss,因为启用了分层命名空间。
  • 嗨。由于您使用 Azure 数据湖 gen2。网址应该类似于abfss://<file-system-name>@<storage-account-name>.dfs.core.windows.net/。请更改并重试。
  • 就是这样,谢谢
  • 嗨。我将我的建议总结为解决方案。既然对你有用,可以请accept it as an answer吗?

标签: parquet azure-databricks azure-blob-storage spark-notebook


【解决方案1】:

我将解决方案总结如下。

如果要将 Azure Data Lake Storage Gen2 挂载为 Azure Databricks 文件系统,则 URL 应类似于 abfss://<file-system-name>@<storage-account-name>.dfs.core.windows.net/。更多详情请参考here

例如

  1. 创建 Azure Data Lake Storage Gen2 帐户。
az login
az storage account create \
    --name <account-name> \
    --resource-group <group name> \
    --location westus \
    --sku Standard_RAGRS \
    --kind StorageV2 \
    --enable-hierarchical-namespace true
  1. 创建服务主体并将 Storage Blob Data Contributor 分配给 Data Lake Storage Gen2 存储帐户范围内的 sp
az login

az ad sp create-for-rbac -n "MyApp" --role "Storage Blob Data Contributor" \
    --scopes /subscriptions/<subscription>/resourceGroups/<resource-group>/providers/Microsoft.Storage/storageAccounts/<storage-account>
  1. 在 Azure databricks(python) 中装载 Azure 数据湖 gen2
configs = {"fs.azure.account.auth.type": "OAuth",
       "fs.azure.account.oauth.provider.type": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
       "fs.azure.account.oauth2.client.id": "<appId>",
       "fs.azure.account.oauth2.client.secret": "<clientSecret>",
       "fs.azure.account.oauth2.client.endpoint": "https://login.microsoftonline.com/<tenant>/oauth2/token",
       "fs.azure.createRemoteFileSystemDuringInitialization": "true"}

dbutils.fs.mount(
source = "abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/folder1",
mount_point = "/mnt/flightdata",
extra_configs = configs)

【讨论】:

    【解决方案2】:

    在 Azure Databricks 中装载存储帐户时需要注意的几个要点。

    对于 Azure Blob 存储source = "wasbs://&lt;container-name&gt;@&lt;storage-account-name&gt;.blob.core.windows.net/&lt;directory-name&gt;"

    对于 Azure Data Lake Storage gen2source = "abfss://&lt;file-system-name&gt;@&lt;storage-account-name&gt;.dfs.core.windows.net/"

    要将 Azure Data Lake Storage Gen2 文件系统或其中的文件夹挂载为 Azure Databricks 文件系统,URL 应类似于 abfss://&lt;file-system-name&gt;@&lt;storage-account-name&gt;.dfs.core.windows.net/

    参考:Azure Databricks - Azure Data Lake Storage Gen2

    【讨论】:

      猜你喜欢
      • 2018-02-03
      • 2017-06-22
      • 2018-11-29
      • 2013-10-01
      • 1970-01-01
      • 1970-01-01
      • 2011-11-20
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多