【问题标题】:How to convert a DataFrame to an Array of dense vectors?如何将 DataFrame 转换为密集向量数组?
【发布时间】:2022-01-21 10:03:57
【问题描述】:

我将如何转换以下 DataFrame

val df = Seq(
  (5.0, 1.0, 1.0, 3.0, 7.0),
  (2.0, 0.0, 3.0, 4.0, 5.0),
  (4.0, 0.0, 0.0, 6.0, 7.0)).toDF("m1", "m2", "m3", "m4", "m5")
//df: res166: org.apache.spark.sql.DataFrame = [m1: int, m2: int ... 3 more fields]

到密集向量数组

val arrayDenseVectors = Array(
      Vectors.dense(5.0, 1.0, 1.0, 3.0, 7.0),
      Vectors.dense(2.0, 0.0, 3.0, 4.0, 5.0),
      Vectors.dense(4.0, 0.0, 0.0, 6.0, 7.0))
//arrayDenseVectors: Array[org.apache.spark.mllib.linalg.Vector] = Array([5.0,1.0,1.0,3.0,7.0], [2.0,0.0,3.0,4.0,5.0], [4.0,0.0,0.0,6.0,7.0])

为了使问题进一步复杂化,df 列的类型为 Int 而不是 Double

【问题讨论】:

    标签: dataframe scala apache-spark vector apache-spark-mllib


    【解决方案1】:

    在RDD上使用map,你可以将每一行转换成Vector,然后收集到一个数组中:

    import org.apache.spark.mllib.linalg.Vectors
    
    val arrayDenseVectors = df.rdd.map { r =>
      Vectors.dense(Array((0 to 3).map(r.getAs[Double](_)): _*))
    }.collect
    
    //arrayDenseVectors: Array[org.apache.spark.ml.linalg.Vector] = Array([5.0,1.0,1.0,3.0], [2.0,0.0,3.0,4.0], [4.0,0.0,0.0,6.0])
    

    【讨论】:

      猜你喜欢
      • 2021-12-26
      • 2017-01-01
      • 2017-05-10
      • 2018-03-19
      • 1970-01-01
      • 2017-07-29
      • 2020-10-12
      • 2018-12-15
      相关资源
      最近更新 更多