【发布时间】:2018-07-15 22:24:45
【问题描述】:
我正在尝试将以下数据框写入 csv 文件:
df:
+--------------------+-------------------------+----------------------------+------------------------------+----------------+-----+--------------------+--------------------+--------+-----+------------+
| title|UserData.UserValue._title|UserData.UserValue._valueRef|UserData.UserValue._valuegiven|UserData._idUser| _id| author| description| genre|price|publish_date|
+--------------------+-------------------------+----------------------------+------------------------------+----------------+-----+--------------------+--------------------+--------+-----+------------+
|XML Developer's G...| _CONFIG_CONTEXT| #id13| qwe| 18|bk101|Gambardella, Matthew|An in-depth look ...|Computer|44.95| 2000-10-01|
| Midnight Rain| _CONFIG_CONTEXT| #id13| dfdfrt| 19|bk102| Ralls, Kim|A former architec...| Fantasy| 5.95| 2000-12-16|
| Maeve Ascendant| _CONFIG_CONTEXT| #id13| dfdf| 20|bk103| Corets, Eva|After the collaps...| Fantasy| 5.95| 2000-11-17|
+--------------------+-------------------------+----------------------------+------------------------------+----------------+-----+--------------------+--------------------+--------+-----+------------+
我正在使用此代码写入 csv 文件:
df.write.format("com.databricks.spark.csv").option("header", "true").save("hdfsOut")
使用它,它会在文件夹 hdfsOut 中创建 3 个不同的 csv 文件。当我尝试使用
var csvdf = spark.read.format("org.apache.spark.csv").option("header", true).csv("hdfsOut")
csvdf.show()
它以不正确的形式显示数据框,如下所示:
+--------------------+-------------------------+----------------------------+------------------------------+----------------+-----+--------------------+--------------------+-----+-----+------------+
| title|UserData.UserValue._title|UserData.UserValue._valueRef|UserData.UserValue._valuegiven|UserData._idUser| _id| author| description|genre|price|publish_date|
+--------------------+-------------------------+----------------------------+------------------------------+----------------+-----+--------------------+--------------------+-----+-----+------------+
| Maeve Ascendant| _CONFIG_CONTEXT| #id13| dfdf| 20|bk103| Corets, Eva|After the collaps...| null| null| null|
| society in ...| the young surviv...| null| null| null| null| null| null| null| null| null|
| foundation ...| Fantasy| 5.95| 2000-11-17| null| null| null| null| null| null| null|
| Midnight Rain| _CONFIG_CONTEXT| #id13| dfdfrt| 19|bk102| Ralls, Kim|A former architec...| null| null| null|
| an evil sor...| and her own chil...| null| null| null| null| null| null| null| null| null|
| of the world."| Fantasy| 5.95| 2000-12-16| null| null| null| null| null| null| null|
|XML Developer's G...| _CONFIG_CONTEXT| #id13| qwe| 18|bk101|Gambardella, Matthew|An in-depth look ...| null| null| null|
| with XML...| Computer| 44.95| 2000-10-01| null| null| null| null| null| null| null|
+--------------------+-------------------------+----------------------------+------------------------------+----------------+-----+--------------------+--------------------+-----+-----+------------+
我需要这个 csv 文件才能将其提供给 Amazon Athena。当我这样做时,Athena 还会以与第二个输出中所示相同的格式呈现数据。理想情况下,从转换后的 csv 文件中读取后,它应该只显示 3 行。
知道为什么会发生这种情况吗?如何解决这个问题,以正确的形式呈现 csv 数据,如第一个输出中所示?
【问题讨论】:
-
在写入 csv 之前,紧接在“society in”和“foundation”等之前的字符是什么?
-
这些基本上是描述标题中的内容,如下所示:
After the collapse of a nanotechnology society in England, the young survivors lay the foundation for a new society.
标签: scala apache-spark dataframe apache-spark-sql