【发布时间】:2016-07-22 05:23:09
【问题描述】:
到目前为止,我已经能够对我的所有文档进行标记,并使用 Spark 的 MLLib 中的 CountVectorizer 和 IDF。我正在尝试从每个文档中获取前 50 个单词,但我不确定如何对 IDF 的输出进行排序。
onePer 是文档 ID 和标记化文档的数据框。
val tf = new CountVectorizer()
.setInputCol("text")
.setOutputCol("features").fit(onePer)
.transform(onePer).select("features").rdd
.map{x:Row => x.getAs[Vector](0)}
tf.cache()
val idf = new IDF().fit(tf)
val tfidf: RDD[Vector] = idf.transform(tf)
这就是我的输出的样子(词汇中的单词数、单词的 id、单词分数)。我想按分数排序,得到前k:
(440,[0,2,3,4,5,6,7,8,9,10,12,15,17,18,19,22,23,24,25,26,27,28,30,31,32,33,34,35,39,41,43,45,47,49,51,52,53,55,57,63,66,69,70,71,74,76,79,80,83,84,85,88,94,95,96,97,99,102,106,107,109,111,117,120,121,124,127,128,129,138,142,145,146,149,154,156,164,166,167,170,171,176,187,189,199,203,204,217,218,219,232,234,236,237,238,240,248,250,251,254,259,263,265,267,280,291,296,302,304,309,319,322,328,333,347,361,364,371,375,384,388,393,395,401,403,433,438,439],[1.3559553712291716,3.9422868018213513,0.6369074622370692,7.795697904781566,3.153829441457081,0.0,5.519201522549892,0.3184537311185346,0.3184537311185346,1.3559553712291716,0.4519851237430572,0.4519851237430572,0.6061358035703155,1.0116009116784799,0.4519851237430572,0.7884573603642703,0.4519851237430572,2.0232018233569597,0.7884573603642703,8.523740461192126,0.6061358035703155,0.6061358035703155,0.6061358035703155,0.6061358035703155,0.7884573603642703,0.6061358035703155,0.6061358035703155,0.6061358035703155,0.7884573603642703,0.7884573603642703,1.0116009116784799,1.0116009116784799,2.0232018233569597,0.7884573603642703,0.7884573603642703,3.897848952390783,0.7884573603642703,0.7884573603642703,1.0116009116784799,5.114244276715276,1.0116009116784799,1.0116009116784799,2.5985659682605218,1.2992829841302609,1.2992829841302609,1.0116009116784799,1.0116009116784799,1.0116009116784799,1.0116009116784799,1.0116009116784799,2.5985659682605218,1.0116009116784799,1.2992829841302609,1.2992829841302609,1.2992829841302609,1.2992829841302609,1.2992829841302609,1.2992829841302609,1.2992829841302609,1.2992829841302609,3.4094961844768505,1.2992829841302609,1.2992829841302609,1.2992829841302609,1.2992829841302609,3.4094961844768505,1.2992829841302609,1.2992829841302609,1.2992829841302609,3.4094961844768505,1.2992829841302609,1.2992829841302609,1.2992829841302609,1.2992829841302609,1.2992829841302609,1.2992829841302609,1.2992829841302609,1.2992829841302609,1.2992829841302609,1.2992829841302609,1.2992829841302609,1.7047480922384253,1.7047480922384253,1.7047480922384253,1.7047480922384253,1.7047480922384253,1.7047480922384253,1.7047480922384253,1.7047480922384253,1.7047480922384253,1.7047480922384253,1.7047480922384253,1.7047480922384253,1.7047480922384253,1.7047480922384253,1.7047480922384253,1.7047480922384253,1.7047480922384253,1.7047480922384253,1.7047480922384253,1.7047480922384253,1.7047480922384253,1.7047480922384253,1.7047480922384253,1.7047480922384253,1.7047480922384253,1.7047480922384253,1.7047480922384253,1.7047480922384253,1.7047480922384253,1.7047480922384253,1.7047480922384253,1.7047480922384253,1.7047480922384253,1.7047480922384253,1.7047480922384253,1.7047480922384253,1.7047480922384253,1.7047480922384253,1.7047480922384253,1.7047480922384253,1.7047480922384253,1.7047480922384253,1.7047480922384253,1.7047480922384253,1.7047480922384253,1.7047480922384253,1.7047480922384253])
更新
我能够通过执行以下操作来完成这项工作:
tfidf.map(x => x.toSparse).map{x => x.indices.zip(x.values)
.sortBy(-_._2)
.take(10)
.map(_._1)
}
【问题讨论】:
-
为什么不使用 HashingTF 而不是 CountVectorizer 并设置要保留的特征数量?
-
我需要能够映射回原始单词,所以很遗憾,哈希对我来说不是一个选项。
-
如果您使用 spark-ml,您可以使用 DataFrame 保留这些信息。不幸的是,我现在无法访问计算机来编写示例代码。
-
@eliasah 我应该在 DataFrame 中保留哪些信息?如果您指的是单词到哈希的映射,我对没有冲突有严格的要求。不过,我关心的仍然是如何对 IDF 的值进行排序。
-
我理解您的担忧。 @tuxda 的回答应该可以解决你的问题。
标签: scala apache-spark apache-spark-mllib