要处理数据中可用的多个date_formats,您可以使用to_date 将它们中的每一个解析为一个新列,然后使用coalesce 第一个非空值
您可以在此找到更多信息 - Parse Date Format
Spark 中可用的日期解析格式 - https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html
一个典型的例子如下-
数据准备
df = pd.read_csv(StringIO("""
Date received,Date sent to company
11/13/2014,11/13/2014
11/13/2014,11/13/2014
11/13/2014,11/13/2014
11/13/2014,11/13/2014
12-11-2014,11/13/2014
12-11-2014,11/13/2014
12-11-2014,11/13/2014
12-11-2014,11-12-2014
12-11-2014,11-12-2014
12-11-2014,11-12-2014
12-11-2014,11-12-2014
12-11-2014,11-12-2014
12-11-2014,11-12-2014
12-11-2014,11-12-2014
12-11-2014,11-12-2014
12-11-2014,11-12-2014
12-11-2014,11-12-2014
12-11-2014,11-12-2014
"""),delimiter=",")
sparkDF = sql.createDataFrame(df)
sparkDF.show()
+-------------+--------------------+
|Date received|Date sent to company|
+-------------+--------------------+
| 11/13/2014| 11/13/2014|
| 11/13/2014| 11/13/2014|
| 11/13/2014| 11/13/2014|
| 11/13/2014| 11/13/2014|
| 12-11-2014| 11/13/2014|
| 12-11-2014| 11/13/2014|
| 12-11-2014| 11/13/2014|
| 12-11-2014| 11-12-2014|
| 12-11-2014| 11-12-2014|
| 12-11-2014| 11-12-2014|
| 12-11-2014| 11-12-2014|
| 12-11-2014| 11-12-2014|
| 12-11-2014| 11-12-2014|
| 12-11-2014| 11-12-2014|
| 12-11-2014| 11-12-2014|
| 12-11-2014| 11-12-2014|
| 12-11-2014| 11-12-2014|
| 12-11-2014| 11-12-2014|
+-------------+--------------------+
至今
sparkDF = sparkDF.withColumn('p1',F.to_date(F.col('Date received'),'MM/dd/yyyy'))\
.withColumn('p2',F.to_date(F.col('Date received'),'MM-dd-yyyy'))
sparkDF.show()
+-------------+--------------------+----------+----------+
|Date received|Date sent to company| p1| p2|
+-------------+--------------------+----------+----------+
| 11/13/2014| 11/13/2014|2014-11-13| null|
| 11/13/2014| 11/13/2014|2014-11-13| null|
| 11/13/2014| 11/13/2014|2014-11-13| null|
| 11/13/2014| 11/13/2014|2014-11-13| null|
| 12-11-2014| 11/13/2014| null|2014-12-11|
| 12-11-2014| 11/13/2014| null|2014-12-11|
| 12-11-2014| 11/13/2014| null|2014-12-11|
| 12-11-2014| 11-12-2014| null|2014-12-11|
| 12-11-2014| 11-12-2014| null|2014-12-11|
| 12-11-2014| 11-12-2014| null|2014-12-11|
| 12-11-2014| 11-12-2014| null|2014-12-11|
| 12-11-2014| 11-12-2014| null|2014-12-11|
| 12-11-2014| 11-12-2014| null|2014-12-11|
| 12-11-2014| 11-12-2014| null|2014-12-11|
| 12-11-2014| 11-12-2014| null|2014-12-11|
| 12-11-2014| 11-12-2014| null|2014-12-11|
| 12-11-2014| 11-12-2014| null|2014-12-11|
| 12-11-2014| 11-12-2014| null|2014-12-11|
+-------------+--------------------+----------+----------+
合并
sparkDF = sparkDF.withColumn('date_received_parsed',F.coalesce(F.col('p1'),F.col('p2')))\
.drop(*['p1','p2'])
sparkDF.show()
+-------------+--------------------+--------------------+
|Date received|Date sent to company|date_received_parsed|
+-------------+--------------------+--------------------+
| 11/13/2014| 11/13/2014| 2014-11-13|
| 11/13/2014| 11/13/2014| 2014-11-13|
| 11/13/2014| 11/13/2014| 2014-11-13|
| 11/13/2014| 11/13/2014| 2014-11-13|
| 12-11-2014| 11/13/2014| 2014-12-11|
| 12-11-2014| 11/13/2014| 2014-12-11|
| 12-11-2014| 11/13/2014| 2014-12-11|
| 12-11-2014| 11-12-2014| 2014-12-11|
| 12-11-2014| 11-12-2014| 2014-12-11|
| 12-11-2014| 11-12-2014| 2014-12-11|
| 12-11-2014| 11-12-2014| 2014-12-11|
| 12-11-2014| 11-12-2014| 2014-12-11|
| 12-11-2014| 11-12-2014| 2014-12-11|
| 12-11-2014| 11-12-2014| 2014-12-11|
| 12-11-2014| 11-12-2014| 2014-12-11|
| 12-11-2014| 11-12-2014| 2014-12-11|
| 12-11-2014| 11-12-2014| 2014-12-11|
| 12-11-2014| 11-12-2014| 2014-12-11|
+-------------+--------------------+--------------------+