【发布时间】:2017-12-11 11:54:32
【问题描述】:
Spark MLlib HashingTF 中的 numFeatures 与文档(句子)中的实际词条数有什么关系吗?
List<Row> data = Arrays.asList(
RowFactory.create(0.0, "Hi I heard about Spark"),
RowFactory.create(0.0, "I wish Java could use case classes"),
RowFactory.create(1.0, "Logistic regression models are neat")
);
StructType schema = new StructType(new StructField[]{
new StructField("label", DataTypes.DoubleType, false, Metadata.empty()),
new StructField("sentence", DataTypes.StringType, false, Metadata.empty())
});
Dataset<Row> sentenceData = spark.createDataFrame(data, schema);
Tokenizer tokenizer = new Tokenizer().setInputCol("sentence").setOutputCol("words");
Dataset<Row> wordsData = tokenizer.transform(sentenceData);
int numFeatures = 20;
HashingTF hashingTF = new HashingTF()
.setInputCol("words")
.setOutputCol("rawFeatures")
.setNumFeatures(numFeatures);
Dataset<Row> featurizedData = hashingTF.transform(wordsData);
正如 Spark Mllib 的文档中所提到的,HashingTF 将每个句子转换为以 numFeatures 为长度的特征向量。 如果这里的每个文档,在这种情况下,句子包含数千个术语,会发生什么? numFeatures 的值应该是多少?如何计算该值?
【问题讨论】:
标签: apache-spark machine-learning apache-spark-mllib tf-idf