【问题标题】:Spark: Error on writing DataFrameSpark:写入 DataFrame 时出错
【发布时间】:2020-06-04 10:51:40
【问题描述】:

我正在尝试将 DataFrame 编写为 json 格式,但是不断出现错误(无论我选择哪种格式):

我的代码:

var finalDF = spark_session.createDataFrame(d, schema)
finalDF.show(10, false)
finalDF.write.mode("overwrite").json("test/df.json")

show 方法打印了预期的结果,但是当它要写的时候抛出这个错误:

    ExitCodeException exitCode=-1073741515: 
    at org.apache.hadoop.util.Shell.runCommand(Shell.java:575)
    at org.apache.hadoop.util.Shell.run(Shell.java:478)
    at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:766)
    at org.apache.hadoop.util.Shell.execCommand(Shell.java:859)
    at org.apache.hadoop.util.Shell.execCommand(Shell.java:842)
    at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:661)
    at org.apache.hadoop.fs.ChecksumFileSystem$1.apply(ChecksumFileSystem.java:501)
    at org.apache.hadoop.fs.ChecksumFileSystem$FsOperation.run(ChecksumFileSystem.java:482)
    at org.apache.hadoop.fs.ChecksumFileSystem.setPermission(ChecksumFileSystem.java:498)
    at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:467)
    at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:433)
    at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:908)
    at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:889)
    at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:786)
    at org.apache.spark.sql.execution.datasources.CodecStreams$.createOutputStream(CodecStreams.scala:81)
    at org.apache.spark.sql.execution.datasources.CodecStreams$.createOutputStreamWriter(CodecStreams.scala:92)
    at org.apache.spark.sql.execution.datasources.json.JsonOutputWriter.<init>(JsonFileFormat.scala:140)
    at org.apache.spark.sql.execution.datasources.json.JsonFileFormat$$anon$1.newInstance(JsonFileFormat.scala:80)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.newOutputWriter(FileFormatWriter.scala:305)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:314)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:258)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:256)
    at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1375)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:261)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(FileFormatWriter.scala:191)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(FileFormatWriter.scala:190)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
    at org.apache.spark.scheduler.Task.run(Task.scala:108)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
18/05/16 17:09:48 WARN FileUtil: Failed to delete file or dir [C:\Users\jsolano\IdeaProjects\Test2\test\df.json\_temporary\0\_temporary\attempt_20180516170948_0005_m_000000_0\.part-00000-ff4d215c-00f2-4585-89bb-d53426315539-c000.json.crc]: it still exists.
18/05/16 17:09:48 WARN FileUtil: Failed to delete file or dir [C:\Users\jsolano\IdeaProjects\Test2\test\df.json\_temporary\0\_temporary\attempt_20180516170948_0005_m_000000_0\part-00000-ff4d215c-00f2-4585-89bb-d53426315539-c000.json]: it still exists.
18/05/16 17:09:48 WARN FileOutputCommitter: Could not delete file:/C:/Users/jsolano/IdeaProjects/Test2/test/df.json/_temporary/0/_temporary/attempt_20180516170948_0005_m_000000_0
18/05/16 17:09:48 ERROR FileFormatWriter: Job job_20180516170948_0005 aborted.
18/05/16 17:09:48 ERROR Executor: Exception in task 0.0 in stage 5.0 (TID 4)
org.apache.spark.SparkException: Task failed while writing rows
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:272)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(FileFormatWriter.scala:191)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(FileFormatWriter.scala:190)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
    at org.apache.spark.scheduler.Task.run(Task.scala:108)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
Caused by: ExitCodeException exitCode=-1073741515: 
    at org.apache.hadoop.util.Shell.runCommand(Shell.java:575)
    at org.apache.hadoop.util.Shell.run(Shell.java:478)
    at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:766)
    at org.apache.hadoop.util.Shell.execCommand(Shell.java:859)
    at org.apache.hadoop.util.Shell.execCommand(Shell.java:842)
    at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:661)
    at org.apache.hadoop.fs.ChecksumFileSystem$1.apply(ChecksumFileSystem.java:501)
    at org.apache.hadoop.fs.ChecksumFileSystem$FsOperation.run(ChecksumFileSystem.java:482)
    at org.apache.hadoop.fs.ChecksumFileSystem.setPermission(ChecksumFileSystem.java:498)
    at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:467)
    at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:433)
    at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:908)
    at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:889)
    at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:786)
    at org.apache.spark.sql.execution.datasources.CodecStreams$.createOutputStream(CodecStreams.scala:81)
    at org.apache.spark.sql.execution.datasources.CodecStreams$.createOutputStreamWriter(CodecStreams.scala:92)
    at org.apache.spark.sql.execution.datasources.json.JsonOutputWriter.<init>(JsonFileFormat.scala:140)
    at org.apache.spark.sql.execution.datasources.json.JsonFileFormat$$anon$1.newInstance(JsonFileFormat.scala:80)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.newOutputWriter(FileFormatWriter.scala:305)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:314)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:258)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:256)
    at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1375)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:261)
    ... 8 more

没有具体说明。

我正在使用带有 IntelliJ 和 Scala 的 Windows 10,我已经设置了 hadoop.home.dir 属性

【问题讨论】:

标签: scala apache-spark apache-spark-sql


【解决方案1】:

我从 hadoop home 中删除了 hadoop.dll,如下链接中所述,它对我有用。

java.lang.UnsatisfiedLinkError: org.apache.hadoop.io.nativeio.NativeIO$Windows.createFileWithMode0(Ljava/lang/String;JJJI)Ljava/io/FileDescriptor

注意:写入 COS 位置时不会抛出 hadoop.dll 错误。

【讨论】:

    【解决方案2】:

    看起来在您之前的运行中,临时文件夹没有被清理。这是已知问题。见链接 - https://issues.apache.org/jira/browse/SPARK-12216 您可以在哪里手动删除临时文件夹 C:/Users/jsolano/IdeaProjects/Test2/test/df.json/_temporary?

    【讨论】:

    • 不,之前的文件夹是否被清理并不重要,我发现 Spark 中的写入操作在 Windows 10 中不起作用。我已经测试过相同的脚本并在 Windows 7 机器上运行完美
    【解决方案3】:

    实际上,我发现 Spark 写入操作在 Windows 10 中不起作用。我在 Windows 7 中一遍又一遍地运行该脚本并且运行良好。

    【讨论】:

      【解决方案4】:

      无论写入“C:\Users*”还是“C:\some_dir”,问题都不会解决

      我想我回答这个问题有点晚了,但我已经用另一种方法解决了这个问题,所以我想分享一下。

      1. 为hadoop下载windows实用程序winutils

      2. 将压缩文件hadoop-winutils-2.6.0.zip解压到C:\Users\user_name\winutils\bin

        李>
      3. 在您的代码中设置系统属性,如

      System.setProperty("hadoop.home.dir", "C:\\Users\\user_name\\winutils");

      就是这样。现在您可以将数据框写入任何 windows 目录。

      【讨论】:

        猜你喜欢
        • 2018-11-30
        • 1970-01-01
        • 2018-11-27
        • 2021-06-15
        • 1970-01-01
        • 1970-01-01
        • 2021-04-05
        • 2018-03-09
        相关资源
        最近更新 更多