使用时间戳和日期类型将 CSV 读入 Spark Dataframe答案

【问题标题】：Reading CSV into a Spark Dataframe with timestamp and date types使用时间戳和日期类型将 CSV 读入 Spark Dataframe
【发布时间】：2017-04-14 04:17:27
【问题描述】：

这是带有 Spark 1.6 的 CDH。

我正在尝试将此假设 CSV 导入 apache Spark DataFrame：

$ hadoop fs -cat test.csv
a,b,c,2016-09-09,a,2016-11-11 09:09:09.0,a
a,b,c,2016-09-10,a,2016-11-11 09:09:10.0,a

我使用 databricks-csv jar。

val textData = sqlContext.read
    .format("com.databricks.spark.csv")
    .option("header", "false")
    .option("delimiter", ",")
    .option("dateFormat", "yyyy-MM-dd HH:mm:ss")
    .option("inferSchema", "true")
    .option("nullValue", "null")
    .load("test.csv")

我使用 inferSchema 为生成的 DataFrame 创建架构。 printSchema() 函数为我提供了上面代码的以下输出：

scala> textData.printSchema()
root
 |-- C0: string (nullable = true)
 |-- C1: string (nullable = true)
 |-- C2: string (nullable = true)
 |-- C3: string (nullable = true)
 |-- C4: string (nullable = true)
 |-- C5: timestamp (nullable = true)
 |-- C6: string (nullable = true)

scala> textData.show()
+---+---+---+----------+---+--------------------+---+
| C0| C1| C2|        C3| C4|                  C5| C6|
+---+---+---+----------+---+--------------------+---+
|  a|  b|  c|2016-09-09|  a|2016-11-11 09:09:...|  a|
|  a|  b|  c|2016-09-10|  a|2016-11-11 09:09:...|  a|
+---+---+---+----------+---+--------------------+---+

C3 列有 String 类型。我希望 C3 具有 date 类型。为了让它成为日期类型，我尝试了以下代码。

val textData = sqlContext.read.format("com.databricks.spark.csv")
    .option("header", "false")
    .option("delimiter", ",")
    .option("dateFormat", "yyyy-MM-dd")
    .option("inferSchema", "true")
    .option("nullValue", "null")
    .load("test.csv")

scala> textData.printSchema
root
 |-- C0: string (nullable = true)
 |-- C1: string (nullable = true)
 |-- C2: string (nullable = true)
 |-- C3: timestamp (nullable = true)
 |-- C4: string (nullable = true)
 |-- C5: timestamp (nullable = true)
 |-- C6: string (nullable = true)

scala> textData.show()
+---+---+---+--------------------+---+--------------------+---+
| C0| C1| C2|                  C3| C4|                  C5| C6|
+---+---+---+--------------------+---+--------------------+---+
|  a|  b|  c|2016-09-09 00:00:...|  a|2016-11-11 00:00:...|  a|
|  a|  b|  c|2016-09-10 00:00:...|  a|2016-11-11 00:00:...|  a|
+---+---+---+--------------------+---+--------------------+---+

此代码与第一个块之间的唯一区别是 dateFormat 选项行（我使用 "yyyy-MM-dd" 而不是 "yyyy- MM-dd HH:mm:ss")。现在我将 C3 和 C5 作为 时间戳（C3 仍然不是日期）。但对于 C5，HH::mm:ss 部分被忽略并在数据中显示为零。

理想情况下，我希望 C3 为日期类型，C5 为时间戳类型，并且其 HH:mm:ss 部分不被忽略。我现在的解决方案看起来像这样。我通过从我的数据库中并行提取数据来制作 csv。我确保将所有日期作为时间戳（不理想）。所以，测试 csv 现在看起来像这样：

$ hadoop fs -cat new-test.csv
a,b,c,2016-09-09 00:00:00,a,2016-11-11 09:09:09.0,a
a,b,c,2016-09-10 00:00:00,a,2016-11-11 09:09:10.0,a

这是我最后的工作代码：

val textData = sqlContext.read.format("com.databricks.spark.csv")
    .option("header", "false")
    .option("delimiter", ",")
    .option("dateFormat", "yyyy-MM-dd HH:mm:ss")
    .schema(finalSchema)
    .option("nullValue", "null")
    .load("new-test.csv")

这里，我在 dateFormat 中使用了完整的时间戳格式（"yyyy-MM-dd HH:mm:ss"）。我手动创建了 finalSchema 实例，其中 c3 是日期，C5 是时间戳类型（Spark sql 类型）。我使用 schema() 函数应用这些模式。输出如下：

scala> finalSchema
res4: org.apache.spark.sql.types.StructType = StructType(StructField(C0,StringType,true), StructField(C1,StringType,true), StructField(C2,StringType,true), StructField(C3,DateType,true), StructField(C4,StringType,true), StructField(C5,TimestampType,true), StructField(C6,StringType,true))

scala> textData.printSchema()
root
 |-- C0: string (nullable = true)
 |-- C1: string (nullable = true)
 |-- C2: string (nullable = true)
 |-- C3: date (nullable = true)
 |-- C4: string (nullable = true)
 |-- C5: timestamp (nullable = true)
 |-- C6: string (nullable = true)


scala> textData.show()
+---+---+---+----------+---+--------------------+---+
| C0| C1| C2|        C3| C4|                  C5| C6|
+---+---+---+----------+---+--------------------+---+
|  a|  b|  c|2016-09-09|  a|2016-11-11 09:09:...|  a|
|  a|  b|  c|2016-09-10|  a|2016-11-11 09:09:...|  a|
+---+---+---+----------+---+--------------------+---+

是否有更简单或开箱即用的方法来解析 csv 文件（具有日期和时间戳类型的 spark 数据帧？

【问题讨论】：

标签： apache-spark apache-spark-sql apache-spark-1.6

【解决方案1】：

对于非平凡案例使用推断选项，它可能不会返回预期结果。正如您在InferSchema.scala 中看到的：

if (field == null || field.isEmpty || field == nullValue) {
  typeSoFar
} else {
  typeSoFar match {
    case NullType => tryParseInteger(field)
    case IntegerType => tryParseInteger(field)
    case LongType => tryParseLong(field)
    case DoubleType => tryParseDouble(field)
    case TimestampType => tryParseTimestamp(field)
    case BooleanType => tryParseBoolean(field)
    case StringType => StringType
    case other: DataType =>
      throw new UnsupportedOperationException(s"Unexpected data type $other")

它只会尝试将每一列与时间戳类型匹配，而不是日期类型，因此这种情况下的“开箱即用解决方案”是不可能的。但是根据我的经验，“更简单”的解决方案是直接使用needed type 定义模式，它将避免推断选项设置一个仅匹配评估的 RDD 而不是整个数据的类型。您的最终架构是一个有效的解决方案。

【讨论】：

【解决方案2】：

这不是很优雅，但您可以像这样将时间戳转换为日期（检查最后一行）：

val textData = sqlContext.read.format("com.databricks.spark.csv")
    .option("header", "false")
    .option("delimiter", ",")
    .option("dateFormat", "yyyy-MM-dd")
    .option("inferSchema", "true")
    .option("nullValue", "null")
    .load("test.csv")
    .withColumn("C4", expr("""to_date(C4)"""))

【讨论】：