【发布时间】:2018-01-25 12:45:23
【问题描述】:
我的输入 csv 文件中有一条记录,
"2017-11-01","2017-10-29","2017-11-04","4532491","","","","Natural States: "The Environmental Imagination" in Maine, Oregon, and the Nation","1000","Richard W. Judd"
当我在 pyspark 中读取此 csv 时,"Natural States: "The Environmental Imagination" in Maine, Oregon, and the Nation" 字段被分隔为单独的列。
>>> df = spark.read.csv('file.csv')
>>> df.show(truncate=False)
+----------+----------+----------+----------+----+----+----+---------------------------------------------------------+-------+----------------+----+---------------+
|_c0 |_c1 |_c2 |_c3 |_c4 |_c5 |_c6 |_c7 |_c8 |_c9 |_c10|_c11 |
+----------+----------+----------+----------+----+----+----+---------------------------------------------------------+-------+----------------+----+---------------+
|2017-11-01|2017-10-29|2017-11-04| 4532491 |null|null|null|Natural States: "The Environmental Imagination" in Maine | Oregon| and the Nation |1000|Richard W. Judd|
+----------+----------+----------+----------+----+----+----+---------------------------------------------------------+-------+----------------+----+---------------+
除了更改输入文件中的分隔符之外的任何解决方法,因为我们无法更改输入文件。
【问题讨论】:
标签: python apache-spark pyspark