【发布时间】:2020-07-17 15:38:57
【问题描述】:
我正在尝试在使用正则表达式计算的路径下读取 JSON 文件,如下所示。
paths.par.foreach
{
path =>
val pathWithRegex = s"${path}/*/${dateRegex}/"
val jsonDF = sqlContext.read.json(pathWithRegex)
}
paths could be - hdfs://servername/data/a, hdfs://servername/data/b, hdfs://servername/data/c
dateRegex could be - 2020-05-*
Directories present in hdfs
hdfs://servername/data/a/something/2020-05-11/file1
hdfs://servername/data/a/something/2020-05-12/file1
hdfs://servername/data/b/something/2020-05-11/file1
hdfs://servername/data/c/something/2020-06-11/file1
当我将 2020-05-* 作为 dateRegex 传递时,它会抛出错误 对于 hdfs://servername/data/c//2020-05-/ 因为路径不存在。 有没有办法不抛出错误并继续? 我尝试使用下面的 checkDirExist 方法,但它似乎不起作用 用于正则表达式/模式。
def checkDirExist(path: String, sc:SparkContext): Boolean = {
val fs = FileSystem.get(sc.hadoopConfiguration)
val p = new Path(path)
fs.exists(p)
}
paths.par.foreach
{
path =>
val pathWithRegex = s"${path}/*/${dateRegex}/"
if(checkDirExist(pathWithRegex, sc)){ //Doesn't work. Always false if pattern is in path string
val jsonDF = sqlContext.read.json(pathWithRegex)
}
}
【问题讨论】:
-
你能分享你的输入和输出吗?
标签: apache-spark hadoop hdfs