Pyspark 变量的数据类型为 decimal(6,-12)。 df.dtypes 和 df.columns 给出错误 ValueError: Could not parse datatype: decimal(6,-12)答案

【问题标题】：Pyspark variable has datatype decimal(6,-12). df.dtypes and df.columns gives error ValueError: Could not parse datatype: decimal(6,-12)Pyspark 变量的数据类型为 decimal(6,-12)。 df.dtypes 和 df.columns 给出错误 ValueError: Could not parse datatype: decimal(6,-12)
【发布时间】：2022-01-19 14:52:56
【问题描述】：

我有一个 spark 数据框，我收到错误 ValueError: Could not parse datatype: decimal(6,-12) 每当我执行 df.dtypes 或 df.columns 因为一个特定变量具有数据类型 decimal (6, -12)。


    df = spark.read.csv("data.csv",inferSchema=True,header=True)  
    df.columns

运行 df.columns 或 df.dtypes 会出现以下错误


    ---------------------------------------------------------------------------
    ValueError                                Traceback (most recent call last)
    <ipython-input-26-0581cf80a9b2> in <module>
    ----> 1 df.columns
    
    /opt/cloudera/parcels/CDH-6.3.4-1.cdh6.3.4.p4623.11628701/lib/spark/python/pyspark/sql/dataframe.py in columns(self)
        934         ['age', 'name']
        935         """
    --> 936         return [f.name for f in self.schema.fields]
        937 
        938     @since(2.3)
    
    /opt/cloudera/parcels/CDH-6.3.4-1.cdh6.3.4.p4623.11628701/lib/spark/python/pyspark/sql/dataframe.py in schema(self)
        251         if self._schema is None:
        252             try:
    --> 253                 self._schema = _parse_datatype_json_string(self._jdf.schema().json())
        254             except AttributeError as e:
        255                 raise Exception(
    
    /opt/cloudera/parcels/CDH-6.3.4-1.cdh6.3.4.p4623.11628701/lib/spark/python/pyspark/sql/types.py in _parse_datatype_json_string(json_string)
        867     >>> check_datatype(complex_maptype)
        868     """
    --> 869     return _parse_datatype_json_value(json.loads(json_string))
        870 
        871 
    
    /opt/cloudera/parcels/CDH-6.3.4-1.cdh6.3.4.p4623.11628701/lib/spark/python/pyspark/sql/types.py in _parse_datatype_json_value(json_value)
        884         tpe = json_value["type"]
        885         if tpe in _all_complex_types:
    --> 886             return _all_complex_types[tpe].fromJson(json_value)
        887         elif tpe == 'udt':
        888             return UserDefinedType.fromJson(json_value)
    
    /opt/cloudera/parcels/CDH-6.3.4-1.cdh6.3.4.p4623.11628701/lib/spark/python/pyspark/sql/types.py in fromJson(cls, json)
        575     @classmethod
        576     def fromJson(cls, json):
    --> 577         return StructType([StructField.fromJson(f) for f in json["fields"]])
        578 
        579     def fieldNames(self):
    
    /opt/cloudera/parcels/CDH-6.3.4-1.cdh6.3.4.p4623.11628701/lib/spark/python/pyspark/sql/types.py in <listcomp>(.0)
        575     @classmethod
        576     def fromJson(cls, json):
    --> 577         return StructType([StructField.fromJson(f) for f in json["fields"]])
        578 
        579     def fieldNames(self):
    
    /opt/cloudera/parcels/CDH-6.3.4-1.cdh6.3.4.p4623.11628701/lib/spark/python/pyspark/sql/types.py in fromJson(cls, json)
        432     def fromJson(cls, json):
        433         return StructField(json["name"],
    --> 434                            _parse_datatype_json_value(json["type"]),
        435                            json["nullable"],
        436                            json["metadata"])
    
    /opt/cloudera/parcels/CDH-6.3.4-1.cdh6.3.4.p4623.11628701/lib/spark/python/pyspark/sql/types.py in _parse_datatype_json_value(json_value)
        880             return DecimalType(int(m.group(1)), int(m.group(2)))
        881         else:
    --> 882             raise ValueError("Could not parse datatype: %s" % json_value)
        883     else:
        884         tpe = json_value["type"]
    
    ValueError: Could not parse datatype: decimal(6,-12)

如果我将列类型更改为双精度或字符串，我可以继续进行。但我正在开发一个自动化工具，需要一个可以处理所有数据集的解决方案。

我尝试了df.columns is giving ValueError: in pyspark 中给出的解决方案，如下所示。


    from pyspark.sql import SparkSession
    from pyspark.sql.types import *
    
    spark = SparkSession.builder.appName("basics").getOrCreate()
    df = spark.read.csv("data.csv",inferSchema=True,header=True)  
    for column_type in df.dtypes:
        if 'string' in column_type[1]:
            df = df.withColumn(column_type[0], df[column_type[0]].cast(StringType()))
        elif 'double' in column_type[1]:
            df = df.withColumn(column_type[0],df[column_type[0]].cast(DoubleType()))
        elif 'int' in column_type[1]:
            df = df.withColumn(column_type[0],df[column_type[0]].cast(IntegerType()))
        elif 'bool' in column_type[1]:
            df = df.withColumn(column_type[0], df[column_type[0]].cast(BooleanType()))
        elif 'decimal' in column_type[1]:
            df = df.withColumn(column_type[0],df[column_type[0]].cast(DoubleType()))
        # add as many conditions as you need for types
    
    df.schema

但不幸的是，这段代码中提到的 df.dtypes 给出了同样的错误。

我能够检查数据类型的唯一一段代码是 df.printSchema()。有没有办法可以读取 df.printSchema() 的输出并将数据类型为 decimal 的变量的数据类型更改为 double 类型？


    df.select('variable_name').printSchema()
    
    root
     |-- variable_name: decimal(6,-12) (nullable = true)

【问题讨论】：

您能否提供导致您的问题的实际代码？如果只有另一个答案的工作代码需要审查，那会非常混乱。
@TilPiffl 我现在已经更新了我的问题中的代码
什么是spark版本？
@MohanaBC 2.4.0

标签： python dataframe apache-spark pyspark

【解决方案1】：

在 PySpark 版本 this jira 页面。

我认为您需要禁用 inferSchema 并创建自定义架构并在读取 CSV 时应用它。

【讨论】：

感谢您的回答。禁用inferSchema 并创建自定义架构将起作用。但是这段代码是自动化工具的一部分，因此我正在寻找一种适用于所有数据集的解决方案。