【问题标题】:Fillna PySpark Dataframe with numpy array ErrorFillna PySpark Dataframe with numpy array Error
【发布时间】:2017-06-06 17:34:52
【问题描述】:

以下是我的 Spark DataFrame 示例,其下方带有 printSchema

+--------------------+---+------+------+--------------------+
|           device_id|age|gender| group|                apps|
+--------------------+---+------+------+--------------------+
|-9073325454084204615| 24|     M|M23-26|                null|
|-8965335561582270637| 28|     F|F27-28|[1.0,1.0,1.0,1.0,...|
|-8958861370644389191| 21|     M|  M22-|[4.0,0.0,0.0,0.0,...|
|-8956021912595401048| 21|     M|  M22-|                null|
|-8910497777165914301| 25|     F|F24-26|                null|
+--------------------+---+------+------+--------------------+
only showing top 5 rows

root
 |-- device_id: long (nullable = true)
 |-- age: integer (nullle = true)
 |-- gender: string (nullable = true)
 |-- group: string (nullable = true)
 |-- apps: vector (nullable = true)

我正在尝试用 np.zeros(19237) 填充“应用程序”列中的空值。但是当我执行时

df.fillna({'apps': np.zeros(19237)}))

我收到一个错误

Py4JJavaError: An error occurred while calling o562.fill.
: java.lang.IllegalArgumentException: Unsupported value type java.util.ArrayList

或者如果我尝试

df.fillna({'apps': DenseVector(np.zeros(19237)})))

我收到一个错误

AttributeError: 'numpy.ndarray' object has no attribute '_get_object_id'

有什么想法吗?

【问题讨论】:

    标签: python apache-spark pyspark


    【解决方案1】:

    DataFrameNaFunctions 仅支持本机(无 UDT)类型的子集,因此您需要一个 UDF。

    from pyspark.sql.functions import coalesce, col, udf
    from pyspark.ml.linalg import Vectors, VectorUDT
    
    def zeros(n):
        def zeros_():
            return Vectors.sparse(n, {})
        return udf(zeros_, VectorUDT())()
    

    示例用法:

    df = spark.createDataFrame(
        [(1, Vectors.dense([1, 2, 3])), (2, None)],
        ("device_id", "apps"))
    
    df.withColumn("apps", coalesce(col("apps"), zeros(3))).show()
    
    +---------+-------------+
    |device_id|         apps|
    +---------+-------------+
    |        1|[1.0,2.0,3.0]|
    |        2|    (3,[],[])|
    +---------+-------------+
    

    【讨论】:

      猜你喜欢
      • 2021-05-02
      • 1970-01-01
      • 1970-01-01
      • 2017-06-26
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2014-11-04
      相关资源
      最近更新 更多