处理大文件时 Azure Databricks 命令卡住。纯 Python。（2.5GB + 文件大小）答案

【问题标题】：Azure Databricks Command Stuck when Processing large files. Pure Python. (2.5gb + file size)处理大文件时 Azure Databricks 命令卡住。纯 Python。（2.5GB + 文件大小）
【发布时间】：2020-09-09 14:39:30
【问题描述】：

我正在使用纯 Python 将 txt 文件转换为 XML 格式。我有一个 txt 格式的从 1kb 到 2.5Gb 的文件列表。转换时大小增长约 5 倍。

问题在于，在处理较大的 2.5Gb 文件时，第一个文件可以工作，但后续处理会挂起并卡住 running command..。较小的文件似乎没有问题。

我已经编辑了代码以确保它使用生成器而不是在内存中保留大列表。
我正在从dbfs 处理，所以连接应该不是问题。
进行内存检查表明它始终只使用约 200Mb 的内存并且大小没有增长。
处理大文件大约需要 10 分钟。
日志中没有 GC 警告或其他错误
Azure Databricks，纯 Python
集群足够大并且只使用 Python，所以这不应该是问题。
重启集群是唯一能让一切恢复正常的方法。
Stuck 命令还会导致集群上的其他笔记本无法工作。

为简单起见，带有编辑的基本代码大纲。

# list of files to convert that are in Azure Blob Storage
text_files = ['file1.txt','file2.txt','file3.txt']

# loop over files and convert them to xml
for file in text_files:
    
    xml_filename = file.replace('.txt','.xml')
    # copy files from blob storage to dbfs
    dbutils.fs.cp(f'dbfs:/mnt/storage_account/projects/xml_converter/input/{file}',f'dbfs:/tmp/temporary/{file}')
    
    # open files and convert to xml
    with open(f'/dbfs/tmp/temporary/{file}','r') as infile, open(f'/dbfs/tmp/temporary/{xml_filename}','a', encoding="utf-8") as outfile:

        # list of strings to join at write time
        to_write = []

        for line in infile:
            # convert to xml
            # code redacted for simplicity

            to_write.append(new_xml)

            # batch the write operations to avoid huge lists
            if len(to_write) > 10_000:

                outfile.write(''.join(to_write))
                to_write = [] # reset the batch

        # do a final write of anything that is in the list
        outfile.write(''.join(to_write))
    
    # move completed files from dbfs to blob storage
    dbutils.fs.cp(f'dbfs:/tmp/temporary/{xml_filename}',f"/mnt/storage_account/projects/xml_converter/output/{xml_filename}")

Azure 集群信息

我希望这段代码可以正常运行。内存似乎不是问题。数据在 dbfs 中，所以这不是 blob 问题。它使用生成器，所以内存中没有多少。我不知所措。任何建议，将不胜感激。感谢收看！

【问题讨论】：

我现在也遇到了这个问题。尝试从已安装的数据湖中读取。显然，无法在 Databricks 中正确读取大于 2GB（最大 32 位整数）的文件。我尝试按照here 指定的方式安装文件夹，但无济于事。也许你会有更好的运气！

标签： python azure databricks azure-databricks

【解决方案1】：

您是否尝试将文件从 Azure 存储复制到本地 Databricks /tmp/ 文件夹而不使用 dbfs？解压大型 .zip 文件时我遇到了类似的问题，并且解决了这个问题。看看这里：https://docs.databricks.com/data/databricks-file-system.html

旁注：由于您使用的是纯 Python，因此工作人员不用于处理文件。您可以切换到单节点设置。

【讨论】：

【解决方案2】：

这是环境行为，如果脚本是纯 Python，那么它只会在 Databricks 集群的驱动节点上运行，这使得它作为单节点处理非常昂贵。与 pyspark 在较小的数据集上相比，Python 肯定会表现得更好。但是当您处理更大的数据集时，您会发现差异。

【讨论】：