【问题标题】：How is it that I can read from an Azure Blob Storage and fail to write onto it?为什么我可以从 Azure Blob 存储读取但无法写入？
【发布时间】：2019-12-07 07:59:25
【问题描述】：

因为我无法将 parquet 文件写入 Azure Blob 存储，所以我的头撞到了墙上。在我的 Azure Databricks Notebook 上，我基本上： 1. 从与数据帧相同的 blob 存储中读取 CSV 并 2. 尝试将数据帧写入同一个存储。

我能够读取 CSV，但是在我尝试写入 parquet 文件时出现此错误。

这是堆栈跟踪：

作业因阶段故障而中止：阶段 8.0 中的任务 0 失败 4 次，最近一次失败：阶段 8.0 中丢失任务 0.3（TID 20、10.139.64.5、执行程序 0）：shaded.databricks.org.apache.hadoop .fs.azure.AzureException：java.io.IOException 在 shaded.databricks.org.apache.hadoop.fs.azure.AzureNativeFileSystemStore.storeEmptyFolder(AzureNativeFileSystemStore.java:1609) ... ... 原因：com.microsoft.azure.storage.StorageException：指定的资源不存在。

这是我的python代码：

spark.conf.set("fs.azure.sas.my_container.my_storage.blob.core.windows.net", dbutils.secrets.get(scope = "my_scope", key = "my_key"))

读取 csv

df100 = spark.read.format("csv").option("header", "true").load("wasbs://my_container@my_storage.blob.core.windows.net/folder/revenue.csv")

写实木复合地板

df100.write.parquet('wasbs://my_container@my_storage.blob.core.windows.net/f1/deh.parquet')

结束

【问题讨论】：

标签： python azure-blob-storage azure-databricks

【解决方案1】：

有效的方法涉及通过其 URL 直接写入 Azure blob 存储容器。当然，使用这种方法，您不必将容器挂载到 DBFS。

下面是用于将 CSV 数据直接写入 Azure Databricks Notebook 中的 Azure blob 存储容器的代码 sn-p。

# Configure blob storage account access key globally
spark.conf.set(
  "fs.azure.account.key.%s.blob.core.windows.net" % storage_name,
  sas_key)

output_container_path = "wasbs://%s@%s.blob.core.windows.net" % (output_container_name, storage_name)
output_blob_folder = "%s/wrangled_data_folder" % output_container_path

# write the dataframe as a single file to blob storage
(dataframe
 .coalesce(1)
 .write
 .mode("overwrite")
 .option("header", "true")
 .format("com.databricks.spark.csv")
 .save(output_blob_folder))

# Get the name of the wrangled-data CSV file that was just saved to Azure blob storage (it starts with 'part-')
files = dbutils.fs.ls(output_blob_folder)
output_file = [x for x in files if x.name.startswith("part-")]

# Move the wrangled-data CSV file from a sub-folder (wrangled_data_folder) to the root of the blob container
# While simultaneously changing the file name
dbutils.fs.mv(output_file[0].path, "%s/predict-transform-output.csv" % output_container_path)

希望这会有所帮助。

【讨论】：

【解决方案2】：

您可以通过将存储安装到数据块上来做到这一点，之后您的存储将可以访问路径，例如 /mnt/yourstoragepath/folder1

为此，请在数据块上设置帐户名、存储的 SAS 密钥

spark.conf.set(
  "fs.azure.account.key.<storage-account-name>.blob.core.windows.net",
  "<storage-account-access-key>")

spark.conf.set(
  "fs.azure.sas.<container-name>.<storage-account-name>.blob.core.windows.net",
  "<complete-query-string-of-sas-for-the-container>")

After setting this, try to read the file like mentioned below


val df = spark.read.parquet("wasbs://<container-name>@<storage-account-name>.blob.core.windows.net/<directory-name>")  ``` 
or

dbutils.fs.ls("wasbs://<container-name>@<storage-account-name>.blob.core.windows.net/<directory-name>")

to write use this syntax

df.write.mode("overwrite").option("path","/mnt/mountName/folder1/tablename").saveAsTable("database.tablename")

请参考这个official link

【讨论】：