【问题标题】：Reading a dataframe after converting to csv file renders incorrect dataframe in Scala转换为 csv 文件后读取数据帧会在 Scala 中呈现不正确的数据帧
【发布时间】：2018-07-15 22:24:45
【问题描述】：

我正在尝试将以下数据框写入 csv 文件：

df:

    +--------------------+-------------------------+----------------------------+------------------------------+----------------+-----+--------------------+--------------------+--------+-----+------------+
|               title|UserData.UserValue._title|UserData.UserValue._valueRef|UserData.UserValue._valuegiven|UserData._idUser|  _id|              author|         description|   genre|price|publish_date|
+--------------------+-------------------------+----------------------------+------------------------------+----------------+-----+--------------------+--------------------+--------+-----+------------+
|XML Developer's G...|          _CONFIG_CONTEXT|                       #id13|                           qwe|              18|bk101|Gambardella, Matthew|An in-depth look ...|Computer|44.95|  2000-10-01|
|       Midnight Rain|          _CONFIG_CONTEXT|                       #id13|                        dfdfrt|              19|bk102|          Ralls, Kim|A former architec...| Fantasy| 5.95|  2000-12-16|
|     Maeve Ascendant|          _CONFIG_CONTEXT|                       #id13|                          dfdf|              20|bk103|         Corets, Eva|After the collaps...| Fantasy| 5.95|  2000-11-17|
+--------------------+-------------------------+----------------------------+------------------------------+----------------+-----+--------------------+--------------------+--------+-----+------------+

我正在使用此代码写入 csv 文件：

df.write.format("com.databricks.spark.csv").option("header", "true").save("hdfsOut")

使用它，它会在文件夹 hdfsOut 中创建 3 个不同的 csv 文件。当我尝试使用

读取该数据帧时

var csvdf = spark.read.format("org.apache.spark.csv").option("header", true).csv("hdfsOut")
csvdf.show()

它以不正确的形式显示数据框，如下所示：

+--------------------+-------------------------+----------------------------+------------------------------+----------------+-----+--------------------+--------------------+-----+-----+------------+
|               title|UserData.UserValue._title|UserData.UserValue._valueRef|UserData.UserValue._valuegiven|UserData._idUser|  _id|              author|         description|genre|price|publish_date|
+--------------------+-------------------------+----------------------------+------------------------------+----------------+-----+--------------------+--------------------+-----+-----+------------+
|     Maeve Ascendant|          _CONFIG_CONTEXT|                       #id13|                          dfdf|              20|bk103|         Corets, Eva|After the collaps...| null| null|        null|
|      society in ...|      the young surviv...|                        null|                          null|            null| null|                null|                null| null| null|        null|
|      foundation ...|                  Fantasy|                        5.95|                    2000-11-17|            null| null|                null|                null| null| null|        null|
|       Midnight Rain|          _CONFIG_CONTEXT|                       #id13|                        dfdfrt|              19|bk102|          Ralls, Kim|A former architec...| null| null|        null|
|      an evil sor...|      and her own chil...|                        null|                          null|            null| null|                null|                null| null| null|        null|
|      of the world."|                  Fantasy|                        5.95|                    2000-12-16|            null| null|                null|                null| null| null|        null|
|XML Developer's G...|          _CONFIG_CONTEXT|                       #id13|                           qwe|              18|bk101|Gambardella, Matthew|An in-depth look ...| null| null|        null|
|         with XML...|                 Computer|                       44.95|                    2000-10-01|            null| null|                null|                null| null| null|        null|
+--------------------+-------------------------+----------------------------+------------------------------+----------------+-----+--------------------+--------------------+-----+-----+------------+

我需要这个 csv 文件才能将其提供给 Amazon Athena。当我这样做时，Athena 还会以与第二个输出中所示相同的格式呈现数据。理想情况下，从转换后的 csv 文件中读取后，它应该只显示 3 行。

知道为什么会发生这种情况吗？如何解决这个问题，以正确的形式呈现 csv 数据，如第一个输出中所示？

【问题讨论】：

在写入 csv 之前，紧接在“society in”和“foundation”等之前的字符是什么？
这些基本上是描述标题中的内容，如下所示：After the collapse of a nanotechnology society in England, the young survivors lay the foundation for a new society.

标签： scala apache-spark dataframe apache-spark-sql

【解决方案1】：

description 列中的数据应包含new line characters 和commas 的数据，如下所示

"After the collapse of a nanotechnology \nsociety in England, the young survivors lay the \nfoundation for a new society"

所以为了测试目的，我创建了一个数据框

val df = Seq(
  ("Maeve Ascendant", "_CONFIG_CONTEXT", "#id13", "dfdf", "20", "bk103", "Corets, Eva", "After the collapse of a nanotechnology \nsociety in England, the young survivors lay the \nfoundation for a new society", "Fantasy", "5.95", "2000-11-17")
).toDF("title", "UserData.UserValue._title", "UserData.UserValue._valueRef", "UserData.UserValue._valuegiven", "UserData._idUser", "_id", "author", "description", "genre", "price", "publish_date")

df.show() 向我展示了与您的问题相同的数据框格式

+---------------+-------------------------+----------------------------+------------------------------+----------------+-----+-----------+--------------------+-------+-----+------------+
|          title|UserData.UserValue._title|UserData.UserValue._valueRef|UserData.UserValue._valuegiven|UserData._idUser|  _id|     author|         description|  genre|price|publish_date|
+---------------+-------------------------+----------------------------+------------------------------+----------------+-----+-----------+--------------------+-------+-----+------------+
|Maeve Ascendant|          _CONFIG_CONTEXT|                       #id13|                          dfdf|              20|bk103|Corets, Eva|After the collaps...|Fantasy| 5.95|  2000-11-17|
+---------------+-------------------------+----------------------------+------------------------------+----------------+-----+-----------+--------------------+-------+-----+------------+

但是df.show(false) 给出了准确的值

+---------------+-------------------------+----------------------------+------------------------------+----------------+-----+-----------+---------------------------------------------------------------------------------------------------------------------+-------+-----+------------+
|title          |UserData.UserValue._title|UserData.UserValue._valueRef|UserData.UserValue._valuegiven|UserData._idUser|_id  |author     |description                                                                                                          |genre  |price|publish_date|
+---------------+-------------------------+----------------------------+------------------------------+----------------+-----+-----------+---------------------------------------------------------------------------------------------------------------------+-------+-----+------------+
|Maeve Ascendant|_CONFIG_CONTEXT          |#id13                       |dfdf                          |20              |bk103|Corets, Eva|After the collapse of a nanotechnology 
society in England, the young survivors lay the 
foundation for a new society|Fantasy|5.95 |2000-11-17  |
+---------------+-------------------------+----------------------------+------------------------------+----------------+-----+-----------+---------------------------------------------------------------------------------------------------------------------+-------+-----+------------+

当您将其保存为 csv 时，spark 将其保存为带有换行符和逗号的文本文件，以被视为简单的文本 csv 文件。而在 csv 格式中，换行生成一个新行，逗号生成一个新字段。 这是数据中的罪魁祸首。

解决方案 1

您可以使用 parquet 格式将数据框保存为 parquet 保存数据框的属性并将其读取为 parquet as

df.write.parquet("hdfsOut")
var csvdf = spark.read.parquet("hdfsOut")

解决方案 2

将其保存为 csv 格式并在阅读时使用multiLine 选项

df.write.format("com.databricks.spark.csv").option("header", "true").save("hdfsOut")
var csvdf = spark.read.format("org.apache.spark.csv").option("multiLine", "true").option("header", true).csv("hdfsOut")

希望回答对你有帮助

【讨论】：

感谢您的回答。是的，这是数据格式不正确并且具有换行符。标记为已接受。