【发布时间】:2017-03-03 01:58:14
【问题描述】:
我正在尝试在 kaggle 上使用 spark 解决这个 problem:
输入的层次结构是这样的:
drivers/{driver_id}/trip#.csv
e.g., drivers/1/1.csv
drivers/1/2.csv
drivers/2/1.csv
我想读取父目录 "drivers" 并且对于每个子目录我想创建一个 pairRDD 键为 (sub_directory,file_name) 和 value 作为文件的内容
我检查了this链接并尝试使用
val text = sc.wholeTextFiles("drivers")
text.collect()
失败并出现错误:
java.lang.ArrayIndexOutOfBoundsException: 0
at org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat$OneFileInfo.<init>(CombineFileInputFormat.java:591)
at org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getMoreSplits(CombineFileInputFormat.java:283)
at org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getSplits(CombineFileInputFormat.java:243)
at org.apache.spark.rdd.WholeTextFileRDD.getPartitions(NewHadoopRDD.scala:267)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1779)
at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:885)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:148)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:109)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:286)
at org.apache.spark.rdd.RDD.collect(RDD.scala:884)
但是当我运行下面的代码时,它可以工作。
val text = sc.wholeTextFiles("drivers/1")
text.collect()
但我不想这样做,因为在这里我必须读取目录 drivers 并循环文件并为每个条目调用 wholeTextFiles。
【问题讨论】:
-
你试过 val text = sc.wholeTextFiles("drivers/*")
-
谢谢。是的,它奏效了。
标签: scala hadoop apache-spark kaggle