【问题标题】:pyspark error when creating df from RDD: TypeError: Can not infer schema for type: <type 'float'>从 RDD 创建 df 时出现 pyspark 错误:TypeError:无法推断类型的架构:<type 'float'>
【发布时间】:2016-09-28 22:37:06
【问题描述】:

我正在使用以下代码将我的 rdd 转换为数据框:

time_df = time_rdd.toDF(['my_time'])

并得到以下错误:

TypeErrorTraceback (most recent call last)
<ipython-input-40-ab9e3025f679> in <module>()
----> 1 time_df = time_rdd.toDF(['my_time'])

/usr/local/spark-latest/python/pyspark/sql/session.py in toDF(self, schema, sampleRatio)
     55         [Row(name=u'Alice', age=1)]
     56         """
---> 57         return sparkSession.createDataFrame(self, schema, sampleRatio)
     58 
     59     RDD.toDF = toDF

/usr/local/spark-latest/python/pyspark/sql/session.py in createDataFrame(self, data, schema, samplingRatio)
    518 
    519         if isinstance(data, RDD):
--> 520             rdd, schema = self._createFromRDD(data.map(prepare), schema, samplingRatio)
    521         else:
    522             rdd, schema = self._createFromLocal(map(prepare, data), schema)

/usr/local/spark-latest/python/pyspark/sql/session.py in _createFromRDD(self, rdd, schema, samplingRatio)
    358         """
    359         if schema is None or isinstance(schema, (list, tuple)):
--> 360             struct = self._inferSchema(rdd, samplingRatio)
    361             converter = _create_converter(struct)
    362             rdd = rdd.map(converter)

/usr/local/spark-latest/python/pyspark/sql/session.py in _inferSchema(self, rdd, samplingRatio)
    338 
    339         if samplingRatio is None:
--> 340             schema = _infer_schema(first)
    341             if _has_nulltype(schema):
    342                 for row in rdd.take(100)[1:]:

/usr/local/spark-latest/python/pyspark/sql/types.py in _infer_schema(row)
    987 
    988     else:
--> 989         raise TypeError("Can not infer schema for type: %s" % type(row))
    990 
    991     fields = [StructField(k, _infer_type(v), True) for k, v in items]

TypeError: Can not infer schema for type: <type 'float'>

有人知道我错过了什么吗?谢谢!

【问题讨论】:

    标签: apache-spark pyspark rdd spark-dataframe


    【解决方案1】:

    你应该把浮点数转换成元组,比如

    time_rdd.map(lambda x: (x, )).toDF(['my_time'])
    

    【讨论】:

      【解决方案2】:

      检查你的 time_rdd 是否为 RDD。

      你得到了什么:

      >>>type(time_rdd)
      
      >>>dir(time_rdd)
      

      【讨论】:

        猜你喜欢
        • 2015-12-20
        • 2016-08-03
        • 1970-01-01
        • 2017-11-27
        • 1970-01-01
        • 2022-01-22
        • 1970-01-01
        • 2020-01-17
        相关资源
        最近更新 更多