【发布时间】:2015-10-12 14:45:52
【问题描述】:
我在 Apache Spark 中有一个带有整数数组的 DataFrame,源是一组图像。我最终想对其进行 PCA,但我无法从我的数组中创建一个矩阵。如何从 RDD 创建矩阵?
> imagerdd = traindf.map(lambda row: map(float, row.image))
> mat = DenseMatrix(numRows=206456, numCols=10, values=imagerdd)
Traceback (most recent call last):
File "<ipython-input-21-6fdaa8cde069>", line 2, in <module>
mat = DenseMatrix(numRows=206456, numCols=10, values=imagerdd)
File "/usr/local/spark/current/python/lib/pyspark.zip/pyspark/mllib/linalg.py", line 815, in __init__
values = self._convert_to_array(values, np.float64)
File "/usr/local/spark/current/python/lib/pyspark.zip/pyspark/mllib/linalg.py", line 806, in _convert_to_array
return np.asarray(array_like, dtype=dtype)
File "/usr/local/python/conda/lib/python2.7/site- packages/numpy/core/numeric.py", line 462, in asarray
return array(a, dtype, copy=False, order=order)
TypeError: float() argument must be a string or a number
从我能想到的每一种可能的安排中,我都遇到了同样的错误:
imagerdd = traindf.map(lambda row: Vectors.dense(row.image))
imagerdd = traindf.map(lambda row: row.image)
imagerdd = traindf.map(lambda row: np.array(row.image))
如果我尝试
> imagedf = traindf.select("image")
> mat = DenseMatrix(numRows=206456, numCols=10, values=imagedf)
Traceback(最近一次调用最后一次):
File "<ipython-input-26-a8cbdad10291>", line 2, in <module>
mat = DenseMatrix(numRows=206456, numCols=10, values=imagedf)
File "/usr/local/spark/current/python/lib/pyspark.zip/pyspark/mllib/linalg.py", line 815, in __init__
values = self._convert_to_array(values, np.float64)
File "/usr/local/spark/current/python/lib/pyspark.zip/pyspark/mllib/linalg.py", line 806, in _convert_to_array
return np.asarray(array_like, dtype=dtype)
File "/usr/local/python/conda/lib/python2.7/site-packages/numpy/core/numeric.py", line 462, in asarray
return array(a, dtype, copy=False, order=order)
ValueError: setting an array element with a sequence.
【问题讨论】:
标签: python matrix apache-spark pyspark apache-spark-mllib