【发布时间】:2016-01-19 07:00:32
【问题描述】:
目前,我已经尝试使用 Apache Spark 和以下 scala 实现 LDA 算法:
// Filter out stopwords
val stopwords: Array[String] = sc.textFile("data/english_stops_words.txt").collect()
val filteredTokens = new StopWordsRemover()
.setStopWords(stopwords)
.setCaseSensitive(false)
.setInputCol("words")
.setOutputCol("filtered")
.transform(tokens)
// Limit to top `vocabSize` most common words and convert to word count vector features
val cvModel = new CountVectorizer()
.setInputCol("filtered")
.setOutputCol("features")
.setVocabSize(vocabSize)
.fit(filteredTokens)
val countVectors = cvModel.transform(filteredTokens)
.select("docId", "features")
.map { case Row(docId: Long, countVector: Vector) => (docId, countVector) }
.cache()
但在那之后,我将此代码从 scala 转换为 Java API:
// Filter out stopwords
List<String> stopwords = sc.textFile("data/english_stops_words.txt")
.collect();
DataFrame filteredTokens = new StopWordsRemover()
.setStopWords(stopwords.toArray(new String[0]))
.setCaseSensitive(false).setInputCol("words")
.setOutputCol("filtered").transform(tokens);
// Limit to top `vocabSize` most common words and convert to word count
// vector features
CountVectorizerModel cvModel = new CountVectorizer()
.setInputCol("filtered").setOutputCol("features")
.setVocabSize(vocabSize).fit(filteredTokens);
JavaRDD<TextId> countVectors = cvModel.transform(filteredTokens)
.select("docId", "features").toJavaRDD()
.map(new Function<Row, TextId>() {
private static final long serialVersionUID = 1L;
@Override
public TextId call(Row row) throws Exception {
return new TextId(row.get(0).toString(), Long.parseLong(row.get(1).toString()));
}
}).cache();
但是 LDA 模型只接受用于 run() 函数的 JavaPairRDD 参数。当我尝试将 countVectors 解析为 JavaPairRDD 时,我卡住了,因为 scala 代码可以做到这一点。 如果您有其他解决方案,请帮助我。 非常感谢。
编辑: 我已经更改了我的代码:
JavaPairRDD<Long, Vector> countVectors = cvModel.transform(filteredTokens)
.select("docId", "features").toJavaRDD()
.mapToPair(new PairFunction<Row, Long, Vector>() {
public Tuple2<Long, Vector> call(Row row) throws Exception {
return new Tuple2<Long, Vector>(Long.parseLong(row.getString(0)), Vectors.dense(row.getDouble(1)));
}
}).cache();
非常感谢@Till Rohrmann。 但是运行程序后,我有异常消息:
线程“主”java.lang.NoSuchMethodError 中的异常:org.apache.spark.sql.Column.as(Ljava/lang/String;Lorg/apache/spark/sql/types/Metadata;)Lorg/apache/火花/sql/列; 在 org.apache.spark.ml.feature.StopWordsRemover.transform(StopWordsRemover.scala:144)
你能帮我解决这个问题吗?
【问题讨论】:
标签: scala apache-spark lda