【问题标题】:To read a field with comma and quotes in csv where comma is delimiter - pyspark在 csv 中读取逗号和引号的字段,其中逗号是分隔符 - pyspark
【发布时间】:2018-01-25 12:45:23
【问题描述】:

我的输入 csv 文件中有一条记录,

"2017-11-01","2017-10-29","2017-11-04","4532491","","","","Natural States: "The Environmental Imagination" in Maine, Oregon, and the Nation","1000","Richard W. Judd"

当我在 pyspark 中读取此 csv 时,"Natural States: "The Environmental Imagination" in Maine, Oregon, and the Nation" 字段被分隔为单独的列。

>>> df = spark.read.csv('file.csv')
>>> df.show(truncate=False)
+----------+----------+----------+----------+----+----+----+---------------------------------------------------------+-------+----------------+----+---------------+
|_c0       |_c1       |_c2       |_c3       |_c4 |_c5 |_c6 |_c7                                                      |_c8    |_c9             |_c10|_c11           |
+----------+----------+----------+----------+----+----+----+---------------------------------------------------------+-------+----------------+----+---------------+
|2017-11-01|2017-10-29|2017-11-04| 4532491  |null|null|null|Natural States: "The Environmental Imagination" in Maine | Oregon| and the Nation |1000|Richard W. Judd|
+----------+----------+----------+----------+----+----+----+---------------------------------------------------------+-------+----------------+----+---------------+

除了更改输入文件中的分隔符之外的任何解决方法,因为我们无法更改输入文件。

【问题讨论】:

    标签: python apache-spark pyspark


    【解决方案1】:

    您可以使用sparkContext 读取文件和split 多个字符为"," 然后将rdd 转换为dataframe 如下

    rdd = sc.textFile("file.csv")
    
    def replaceFunc(words):
        result = []
        for word in words.split("\",\""):
            result.append(word.replace("\"", ""))
        return result
    
    rdd.map(replaceFunc).toDF().show(1, False)
    

    你应该有以下输出

    +----------+----------+----------+-------+---+---+---+------------------------------------------------------------------------------+----+---------------+
    |_1        |_2        |_3        |_4     |_5 |_6 |_7 |_8                                                                            |_9  |_10            |
    +----------+----------+----------+-------+---+---+---+------------------------------------------------------------------------------+----+---------------+
    |2017-11-01|2017-10-29|2017-11-04|4532491|   |   |   |Natural States: The Environmental Imagination in Maine, Oregon, and the Nation|1000|Richard W. Judd|
    +----------+----------+----------+-------+---+---+---+------------------------------------------------------------------------------+----+---------------+
    

    【讨论】:

      【解决方案2】:

      这可能适用于sep='","',例如:

      spark.read.csv('file.csv', sep='","')
      

      【讨论】:

      • 这会抛出异常,分隔符不能超过一个字符。
      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 2016-09-28
      • 1970-01-01
      • 2013-12-06
      • 2021-06-16
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多