【问题标题】:Spark AVRO S3 read not working for partitioned dataSpark AVRO S3 读取不适用于分区数据
【发布时间】:2017-04-01 11:32:00
【问题描述】:

当我读取特定文件时,它会起作用:

val filePath= "s3n://bucket_name/f1/f2/avro/dt=2016-10-19/hr=19/000000"        
val df = spark.read.avro(filePath)

但如果我指向一个文件夹来读取日期分区数据,它会失败:

val filePath="s3n://bucket_name/f1/f2/avro/dt=2016-10-19/"

我收到此错误:

Exception in thread "main" org.apache.hadoop.fs.s3.S3Exception: org.jets3t.service.S3ServiceException: S3 HEAD request failed for '/f1%2Ff2%2Favro%2Fdt%3D2016-10-19' - ResponseCode=403, ResponseMessage=Forbidden
at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.handleServiceException(Jets3tNativeFileSystemStore.java:245)
at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.retrieveMetadata(Jets3tNativeFileSystemStore.java:119)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
at org.apache.hadoop.fs.s3native.$Proxy7.retrieveMetadata(Unknown Source)
at org.apache.hadoop.fs.s3native.NativeS3FileSystem.getFileStatus(NativeS3FileSystem.java:414)
at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1397)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$12.apply(DataSource.scala:374)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$12.apply(DataSource.scala:364)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.immutable.List.foreach(List.scala:381)
at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
at scala.collection.immutable.List.flatMap(List.scala:344)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:364)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:149)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:132)
at com.databricks.spark.avro.package$AvroDataFrameReader$$anonfun$avro$2.apply(package.scala:34)
at com.databricks.spark.avro.package$AvroDataFrameReader$$anonfun$avro$2.apply(package.scala:34)
at BasicS3Avro$.main(BasicS3Avro.scala:55)
at BasicS3Avro.main(BasicS3Avro.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:147)

我错过了什么吗?

【问题讨论】:

  • 它看起来像一个身份验证错误,服务响应为 403。也许你应该检查与存储桶关联的策略。
  • 但是当我访问特定文件时,相同的凭据有效。另外,我可以使用 aws 命令行列出文件夹的内容。
  • 好的,让我尝试在本地复制它...
  • 试试吧:val filePath="s3n://bucket_name/f1/f2/avro/dt=2016-10-19" - 没有结束斜线
  • @devsprint 同样的错误:|

标签: apache-spark amazon-s3 avro spark-avro


【解决方案1】:

更新、维护的 s3a 客户端报告了什么?

【讨论】:

  • 抱歉,您能重新表述一下您的问题吗? @Steve Loughran
猜你喜欢
  • 2019-04-14
  • 1970-01-01
  • 2018-01-29
  • 2018-11-24
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2015-10-31
  • 2021-12-27
相关资源
最近更新 更多