无法将类型 <class 'pyspark.ml.linalg.SparseVector'> 转换为 Vector答案

【问题标题】：Cannot convert type <class 'pyspark.ml.linalg.SparseVector'> into Vector无法将类型 <class 'pyspark.ml.linalg.SparseVector'> 转换为 Vector
【发布时间】：2017-04-25 17:52:16
【问题描述】：

鉴于我的 pyspark Row 对象：

>>> row
Row(clicked=0, features=SparseVector(7, {0: 1.0, 3: 1.0, 6: 0.752}))
>>> row.clicked
0
>>> row.features
SparseVector(7, {0: 1.0, 3: 1.0, 6: 0.752})
>>> type(row.features)
<class 'pyspark.ml.linalg.SparseVector'>

但是，row.features 未能通过 isinstance(row.features,Vector) 测试。

>>> isinstance(SparseVector(7, {0: 1.0, 3: 1.0, 6: 0.752}), Vector)
True
>>> isinstance(row.features, Vector)
False
>>> isinstance(deepcopy(row.features), Vector)
False

这个奇怪的错误给我带来了巨大的麻烦。如果不传递“isinstance(row.features, Vector)”，我将无法使用 map 函数生成 LabeledPoint。如果有人能解决这个问题，我将不胜感激。

【问题讨论】：

标签： apache-spark pyspark apache-spark-sql apache-spark-mllib apache-spark-ml

【解决方案1】：

如果您只想将 SparseVectors 从 pyspark.ml 转换为 pyspark.mllib SparseVectors，您可以使用 MLUtils。假设 df 是您的数据框，并且带有 SparseVectors 的列被命名为“特征”。然后下面几行让你完成这个：

from pyspark.mllib.util import MLUtils
df = MLUtils.convertVectorColumnsFromML(df, "features")

这个问题发生在我身上，因为当使用 pyspark.ml.feature 中的 CountVectorizer 时，由于与 pyspark.ml 中的 SparseVector 不兼容，我无法创建 LabeledPoints

我想知道为什么他们的最新文档CountVectorizer 不使用“新”SparseVector 类。由于分类算法需要 LabeledPoints，这对我来说毫无意义......

更新：我误解了 ml 库是为 DataFrame-Objects 设计的，而 mllib 库是为 RDD-objects 设计的。由于 Spark > 2,0，建议使用 DataFrame-Datastructure，因为 SparkSession 比 SparkContext 更兼容（但存储一个 SparkContext-object）并且确实提供 DataFrame 而不是 RDD。我发现这篇文章让我产生了“啊哈”效应：mllib and ml。谢谢 Alberto Bonsanto :)。

使用 f.e.来自 mllib 的 NaiveBayes，我必须将我的 DataFrame 转换为来自 mllib 的 NaiveBayes 的 LabeledPoint-objects。

但是从 ml 中使用 NaiveBayes 会更容易，因为您不需要 LabeledPoints，而只需为您的数据框指定 feature-和 class-col。

PS：我为这个问题苦苦挣扎了好几个小时，所以我觉得我需要在这里发布它:)

【讨论】：

pyspark 感觉是一个很好的例子，说明如何不做一个 api，tbh - 不知何故既不直观又不稳定。通常它是不直观的，但至少是稳定的，因为修复被认为是不值得的。或者不稳定但至少直观，因为修复程序正在这样做。我想它的缺点是成为java 实现背后的二等公民，所以它被留下了语言翻译包。

【解决方案2】：

这不太可能是错误。您没有提供code required to reproduce the issue，但很可能您将 Spark 2.0 与 ML 转换器一起使用，并且您比较了错误的实体。

让我们用一个例子来说明这一点。简单数据

from pyspark.ml.feature import OneHotEncoder

row = OneHotEncoder(inputCol="x", outputCol="features").transform(
    sc.parallelize([(1.0, )]).toDF(["x"])
).first()

现在让我们导入不同的矢量类：

from pyspark.ml.linalg import Vector as MLVector, Vectors as MLVectors
from pyspark.mllib.linalg import Vector as MLLibVector, Vectors as MLLibVectors
from pyspark.mllib.regression import  LabeledPoint

并进行测试：

isinstance(row.features, MLLibVector)

False

isinstance(row.features, MLVector)

True

如您所见，我们拥有的是 pyspark.ml.linalg.Vector 而不是与旧 API 不兼容的 pyspark.mllib.linalg.Vector：

LabeledPoint(0.0, row.features)

TypeError                                 Traceback (most recent call last)
...
TypeError: Cannot convert type <class 'pyspark.ml.linalg.SparseVector'> into Vector

您可以将 ML 对象转换为 MLLib 之一：

from pyspark.ml import linalg as ml_linalg

def as_mllib(v):
    if isinstance(v, ml_linalg.SparseVector):
        return MLLibVectors.sparse(v.size, v.indices, v.values)
    elif isinstance(v, ml_linalg.DenseVector):
        return MLLibVectors.dense(v.toArray())
    else:
        raise TypeError("Unsupported type: {0}".format(type(v)))

LabeledPoint(0, as_mllib(row.features))

LabeledPoint(0.0, (1,[],[]))

或者简单地说：

LabeledPoint(0, MLLibVectors.fromML(row.features))

LabeledPoint(0.0, (1,[],[]))

但一般来说，您应该避免必要的情况。

【讨论】：