【发布时间】:2020-09-18 05:33:46
【问题描述】:
我正在使用 Spark Scala 来计算 Dataframe 行之间的余弦相似度。
数据框格式如下:
root
|-- id: long (nullable = true)
|-- features: vector (nullable = true)
数据框示例如下:
+---+--------------------+
| id| features|
+---+--------------------+
| 65|(10000,[48,70,87,...|
|191|(10000,[1,73,77,1...|
+---+--------------------+
给我结果的代码如下:
val df = spark.read.json("articles_line.json")
val tokenizer = new Tokenizer().setInputCol("desc").setOutputCol("words")
val wordsDF = tokenizer.transform(df)
def flattenWords = udf( (s: Seq[Seq[String]]) => s.flatMap(identity) )
val groupedDF = wordsDF.groupBy("id").
agg(flattenWords(collect_list("words")).as("grouped_words"))
val hashingTF = new HashingTF().
setInputCol("grouped_words").setOutputCol("rawFeatures").setNumFeatures(10000)
val featurizedData = hashingTF.transform(groupedDF)
val idf = new IDF().setInputCol("rawFeatures").setOutputCol("features")
val idfModel = idf.fit(featurizedData)
val rescaledData = idfModel.transform(featurizedData)
val asDense = udf((v: Vector) => v.toDense) //transform to dense matrix
val newDf = rescaledData.select('id, 'features)
.withColumn("dense_features", asDense($"features")
最终的数据框看起来像
+-----+--------------------+--------------------+
| id| features| dense_features|
+-----+--------------------+--------------------+
|21209|(10000,[128,288,2...|[0.0,0.0,0.0,0.0,...|
|21223|(10000,[8,18,32,4...|[0.0,0.0,0.0,0.0,...|
+-----+--------------------+--------------------+
我不明白如何处理“dense_features”来计算余弦相似度。 This article 对我不起作用。感谢任何帮助。
一行dense_features的示例。为简单起见,剪掉了长度。
[[0.0,0.0,0.0,0.0,7.08,0.0,0.0,0.0,0.0,2.24,0.0,0.0,0.0,0.0,0.0,,9.59]]
【问题讨论】:
-
我们可以得到一整行数据吗?
.show(false)。我要说的是密集的可能是正确的。刚开始就是全0。 -
添加示例
标签: scala apache-spark-sql tf-idf cosine-similarity