我的 sparkDF.persist(DISK_ONLY) 数据存储在哪里？答案

【问题标题】：Where is my sparkDF.persist(DISK_ONLY) data stored?我的 sparkDF.persist(DISK_ONLY) 数据存储在哪里？
【发布时间】：2018-01-26 03:49:01
【问题描述】：

我想了解更多关于 hadoop out of spark 的持久化策略。

当我使用 DISK_ONLY 策略持久化数据帧时，我的数据存储在哪里（路径/文件夹...）？我在哪里指定这个位置？

【问题讨论】：

小改动：Cache on Dataset 意味着持久化 level = MEMORY AND DISK，所以缓存也可以写入磁盘

标签： hadoop apache-spark persist

【解决方案1】：

对于简短的回答，我们可以看看the documentation 关于spark.local.dir：

用于 Spark 中“临时”空间的目录，包括地图输出文件和存储在磁盘上的 RDD。这应该在系统中一个快速的本地磁盘上。它也可以是不同磁盘上多个目录的逗号分隔列表。注意：在 Spark 1.0 和更高版本中，这将被集群管理器设置的 SPARK_LOCAL_DIRS（独立、Mesos）或 LOCAL_DIRS（YARN）环境变量覆盖。

为了更深入地了解我们可以查看代码：DataFrame（只是Dataset[Row]）基于RDDs，它利用相同的持久性机制。 RDDs 将此委托给 SparkContext，这将其标记为持久性。然后，该任务实际上由 org.apache.spark.storage 包中的几个类处理：首先，BlockManager 只管理要持久化的数据块以及如何做到这一点的策略，将实际持久性委托给 DiskStore (当然是在磁盘上写入时），它代表一个用于写入的高级接口，而DiskBlockManager 则用于更多低级操作。

希望您了解现在该往哪里看，这样我们就可以继续前进并了解数据实际保存在哪里以及我们如何配置它：DiskBlockManager 调用了帮助程序 Utils.getConfiguredLocalDirs，这是为了实用我将在这里复制（取自链接的 2.2.1 版本，在撰写本文时的最新版本）：

def getConfiguredLocalDirs(conf: SparkConf): Array[String] = {
    val shuffleServiceEnabled = conf.getBoolean("spark.shuffle.service.enabled", false)
    if (isRunningInYarnContainer(conf)) {
        // If we are in yarn mode, systems can have different disk layouts so we must set it
        // to what Yarn on this system said was available. Note this assumes that Yarn has
        // created the directories already, and that they are secured so that only the
        // user has access to them.
        getYarnLocalDirs(conf).split(",")
    } else if (conf.getenv("SPARK_EXECUTOR_DIRS") != null) {
        conf.getenv("SPARK_EXECUTOR_DIRS").split(File.pathSeparator)
    } else if (conf.getenv("SPARK_LOCAL_DIRS") != null) {
        conf.getenv("SPARK_LOCAL_DIRS").split(",")
    } else if (conf.getenv("MESOS_DIRECTORY") != null && !shuffleServiceEnabled) {
        // Mesos already creates a directory per Mesos task. Spark should use that directory
        // instead so all temporary files are automatically cleaned up when the Mesos task ends.
        // Note that we don't want this if the shuffle service is enabled because we want to
        // continue to serve shuffle files after the executors that wrote them have already exited.
        Array(conf.getenv("MESOS_DIRECTORY"))
    } else {
        if (conf.getenv("MESOS_DIRECTORY") != null && shuffleServiceEnabled) {
        logInfo("MESOS_DIRECTORY available but not using provided Mesos sandbox because " +
            "spark.shuffle.service.enabled is enabled.")
        }
        // In non-Yarn mode (or for the driver in yarn-client mode), we cannot trust the user
        // configuration to point to a secure directory. So create a subdirectory with restricted
        // permissions under each listed directory.
        conf.get("spark.local.dir", System.getProperty("java.io.tmpdir")).split(",")
    }
}

我相信，该代码非常不言自明，并且注释很好（并且与文档的内容完全匹配）：在 Yarn 上运行时，有一个依赖于 Yarn 容器存储的特定策略，在 Mesos 中它要么使用 Mesos 沙箱（除非启用 shuffle 服务），在所有其他情况下，它将转到spark.local.dir 或java.io.tmpdir 下设置的位置（可能是/tmp/）。

所以，如果你只是在玩，数据最有可能存储在/tmp/ 下，否则很大程度上取决于你的环境和配置。

【讨论】：

非常感谢@stefanobaghino 为您在这个结构良好且详细的答案中付出的努力。对我来说，下一步是研究 getYarnLocalDirs(conf).split(",") 加载的纱线配置。
好答案。我认为 "a DataFrame is based on RDDs" 这不是我认为的，并且可以生成 Spark 执行的 RDD 沿袭（参见QueryExecution.toRDD）跨度>
@JacekLaskowski 谢谢，如果您认为这是一个很好的答案，我很确定它是。 :) 感谢您的评论，我实际上并没有意识到这一点。我将尝试以更准确传达的方式编辑答案。只是为了不写任何不准确的东西，重点仍然有效：实际的缓存委托给RDD，对吧？

【解决方案2】：

总结一下我的 YARN 环境：

在@stefanobaghino 的指导下，我能够在加载纱线配置的代码中更进一步。

val localDirs = Option(conf.getenv("LOCAL_DIRS")).getOrElse("")

在 yarn-default.xml

的 yarn.nodemanager.local-dirs 选项中设置

我的问题的背景是，由错误引起的

2018-01-23 16:57:35,229 WARN org.apache.hadoop.yarn.server.nodemanager.DirectoryCollection: Directory /data/1/yarn/local error, used space above threshold of 98.5%, removing from list of valid directories

我的 spark-job 有时会被杀死，我想了解这个磁盘是否也用于我在运行作业时的持久数据（这实际上是一个巨大的数量）。

所以事实证明，这正是使用 DISK 策略持久化数据时数据所在的文件夹。

非常感谢您在此问题上提供的所有有用指导！

【讨论】：