【问题标题】:Databricks dbutils not displaying folder list under specfic folderDatabricks dbutils不显示特定文件夹下的文件夹列表
【发布时间】:2021-12-24 05:59:29
【问题描述】:

我在一个容器下有三个文件夹

文件夹的结构

 folder1
   |_ file1.json
   |_ file2.json
 folder2
   |_ sub-folder1
       |_ file1.json
   |_ sub_folder2
       |_ sub-folder01
       |_ file2.json
 folder3
    |_ sub-folder1
        |_ file1.json

注意:folder2 只有文件夹列表,其中可能有文件,我正在尝试迭代并在 python 代码中查找特定文件名。

from pyspark.sql.functions import col,lit
from datetime import datetime

app_storage_acct_name= 'mystorageaccnt1'
app_storage_acct_scope="{}-scope".format(app_storage_acct_name)

config_secret_set_url = "fs.azure.account.key.{}.blob.core.windows.net".format(app_storage_acct_name)
secret = dbutils.secrets.get(scope = app_storage_acct_scope, key = app_storage_acct_key)
dbutils.fs.mount(
  source = "wasbs://mycontainer1@mystirageaccnt1.blob.core.windows.net",
  mount_point = "/mnt/my-data-src",
  extra_configs = {config_secret_set_url:dbutils.secrets.get(scope = app_storage_acct_scope, key = app_storage_acct_key)})

dbutils.fs.ls('/mnt/my-data-src/')

上面的代码打印了三个我也在 blob 存储浏览器中看到的文件夹

Out[29]: [FileInfo(path='dbfs:/mnt/my-data-src/folder1/', name='folder1/', size=0),
 FileInfo(path='dbfs:/mnt/my-data-src/folder2/', name='folder2/', size=0),
 FileInfo(path='dbfs:/mnt/my-data-src/folder3/', name='folder3/', size=0)]

当我在下面使用时,会列出文件

dbutils.fs.ls('/mnt/my-data-src/folder1/')
  • 输出如下所示
Out[30]: [FileInfo(path='dbfs:/mnt/my-data-src/folder1/file1.json', name='file1.json', size=1011),
 FileInfo(path='dbfs:/mnt/my-data-src....,

当我尝试列出 folder2 下的文件夹时

dbutils.fs.ls('/mnt/my-data-src/folder2/')
  • 输出java.io.FileNotFoundException: File /folder2 does not exist.
ExecutionError                            Traceback (most recent call last)
<command-2660727172978602> in <module>
----> 1 dbutils.fs.ls('/mnt/my-data-src/folder2/')

/databricks/python_shell/dbruntime/dbutils.py in f_with_exception_handling(*args, **kwargs)
    317                     exc.__context__ = None
    318                     exc.__cause__ = None
--> 319                     raise exc
    320 
    321             return f_with_exception_handling

ExecutionError: An error occurred while calling z:com.databricks.backend.daemon.dbutils.FSUtils.ls.
: java.io.FileNotFoundException: File /folder2 does not exist.
    at shaded.databricks.org.apache.hadoop.fs.azure.NativeAzureFileSystem.listStatus(NativeAzureFileSystem.java:2468)
    at com.databricks.backend.daemon.data.client.DBFSV2.$anonfun$listStatus$2(DatabricksFileSystemV2.scala:95)
    at com.databricks.s3a.S3AExceptionUtils$.convertAWSExceptionToJavaIOException(DatabricksStreamUtils.scala:66)
    at com.databricks.backend.daemon.data.client.DBFSV2.$anonfun$listStatus$1(DatabricksFileSystemV2.scala:92)
    at com.databricks.logging.UsageLogging.$anonfun$recordOperation$1(UsageLogging.scala:395)
    at com.databricks.logging.UsageLogging.executeThunkAndCaptureResultTags$1(UsageLogging.scala:484)
    at com.databricks.logging.UsageLogging.$anonfun$recordOperationWithResultTags$4(UsageLogging.scala:504)
    at com.databricks.logging.UsageLogging.$anonfun$withAttributionContext$1(UsageLogging.scala:266)
    at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62)
    at com.databricks.logging.UsageLogging.withAttributionContext(UsageLogging.scala:261)
    at com.databricks.logging.UsageLogging.withAttributionContext$(UsageLogging.scala:258)
    at com.databricks.backend.daemon.data.client.DatabricksFileSystemV2.withAttributionContext(DatabricksFileSystemV2.scala:510)
    at com.databricks.logging.UsageLogging.withAttributionTags(UsageLogging.scala:305)
    at com.databricks.logging.UsageLogging.withAttributionTags$(UsageLogging.scala:297)
    at com.databricks.backend.daemon.data.client.DatabricksFileSystemV2.withAttributionTags(DatabricksFileSystemV2.scala:510)
    at com.databricks.logging.UsageLogging.recordOperationWithResultTags(UsageLogging.scala:479)
    at com.databricks.logging.UsageLogging.recordOperationWithResultTags$(UsageLogging.scala:404)
    at com.databricks.backend.daemon.data.client.DatabricksFileSystemV2.recordOperationWithResultTags(DatabricksFileSystemV2.scala:510)
    at com.databricks.logging.UsageLogging.recordOperation(UsageLogging.scala:395)
    at com.databricks.logging.UsageLogging.recordOperation$(UsageLogging.scala:367)
    at com.databricks.backend.daemon.data.client.DatabricksFileSystemV2.recordOperation(DatabricksFileSystemV2.scala:510)
    at com.databricks.backend.daemon.data.client.DBFSV2.listStatus(DatabricksFileSystemV2.scala:92)
    at com.databricks.backend.daemon.data.client.DatabricksFileSystem.listStatus(DatabricksFileSystem.scala:150)
    at com.databricks.backend.daemon.dbutils.FSUtils$.$anonfun$ls$1(DBUtilsCore.scala:154)
    at com.databricks.backend.daemon.dbutils.FSUtils$.withFsSafetyCheck(DBUtilsCore.scala:91)
    at com.databricks.backend.daemon.dbutils.FSUtils$.ls(DBUtilsCore.scala:153)
    at com.databricks.backend.daemon.dbutils.FSUtils.ls(DBUtilsCore.scala)
    at sun.reflect.GeneratedMethodAccessor223.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:380)
    at py4j.Gateway.invoke(Gateway.java:295)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:251)
    at java.lang.Thread.run(Thread.java:748)

dbutils.fs.ls() 在这种情况下没有列出包含文件夹的文件夹有什么具体原因吗?

回答: 我试图直接访问一个文件并注意到它是 blob 类型Append Blobdbutils.fs.ls('/mnt/my-data-src/folder2/file.json) 报告以下消息。

shaded.databricks.org.apache.hadoop.fs.azure.AzureException: hadoop_azure_shaded.com.microsoft.azure.storage.StorageException: Incorrect Blob type, please use the correct Blob type to access a blob on the server. Expected BLOCK_BLOB, actual APPEND_BLOB.

有没有办法在databricks中列出blob类型追加?

【问题讨论】:

  • 使用了哪些 DBR 版本?
  • 9.1 LTS 与 Spark 3.1.2 , Scala 2.12

标签: python databricks azure-databricks


【解决方案1】:

Azure Databricks 确实支持使用 Hadoop API 访问附加 blob,但仅限于附加到文件时。

没有解决此问题的方法。

使用 Azure CLI 或 Azure Storage SDK for Python 确定目录是否包含附加 blob 或对象是否为附加 blob。

您可以使用 RDD API 实现 Spark SQL UDF 或自定义函数,以使用 Azure Storage SDK for Python 加载、读取或转换 Blob。

为此问题提供了official documentation

【讨论】:

    【解决方案2】:

    几乎没有研究,在文档中找到了链接。 Official Doc

    • 最后有一种方法可以在 Databricks 笔记本中将它们列为文件。参考git sample link

    步骤 1. 安装 azure-storage-blob 模块,在工作区中使用临时集群。

    %pip install azure-storage-blob
    

    步骤2.获取azure store的连接字符串和

    from azure.storage.blob import ContainerClient
    
    CONNECTION_STRING_OF_AZURE_BLOB_STORAGE='<connection-string-blob-storage-of-Access (IAM)>'
    
    container = ContainerClient.from_connection_string(CONNECTION_STRING_OF_AZURE_BLOB_STORAGE, container_name="my-app-container")
    #print(len(item))
    blob_list = container.list_blobs()
    for blob in blob_list:
       print(blob.name + '\n')
    
    • 使用上面的代码,我可以列出每个文件夹中的所有文件。

    【讨论】:

      猜你喜欢
      • 2019-06-02
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2015-10-29
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多