【发布时间】:2017-06-06 17:34:52
【问题描述】:
以下是我的 Spark DataFrame 示例,其下方带有 printSchema:
+--------------------+---+------+------+--------------------+
| device_id|age|gender| group| apps|
+--------------------+---+------+------+--------------------+
|-9073325454084204615| 24| M|M23-26| null|
|-8965335561582270637| 28| F|F27-28|[1.0,1.0,1.0,1.0,...|
|-8958861370644389191| 21| M| M22-|[4.0,0.0,0.0,0.0,...|
|-8956021912595401048| 21| M| M22-| null|
|-8910497777165914301| 25| F|F24-26| null|
+--------------------+---+------+------+--------------------+
only showing top 5 rows
root
|-- device_id: long (nullable = true)
|-- age: integer (nullle = true)
|-- gender: string (nullable = true)
|-- group: string (nullable = true)
|-- apps: vector (nullable = true)
我正在尝试用 np.zeros(19237) 填充“应用程序”列中的空值。但是当我执行时
df.fillna({'apps': np.zeros(19237)}))
我收到一个错误
Py4JJavaError: An error occurred while calling o562.fill.
: java.lang.IllegalArgumentException: Unsupported value type java.util.ArrayList
或者如果我尝试
df.fillna({'apps': DenseVector(np.zeros(19237)})))
我收到一个错误
AttributeError: 'numpy.ndarray' object has no attribute '_get_object_id'
有什么想法吗?
【问题讨论】:
标签: python apache-spark pyspark