在 Spark 中使用 NaiveBayes 分类器进行多标签分类答案

【问题标题】：Multilabel Classification using NaiveBayes Classifier in Spark在 Spark 中使用 NaiveBayes 分类器进行多标签分类
【发布时间】：2017-05-04 02:09:24
【问题描述】：

我的数据格式为
blah sentence one --> label1, label2
blah sentence two --> label2, label4
blah sentence three --> label3

如何在 Spark 中使用 OneVsRestClassifier 和 NaiveBayesClassifier？（即，我的数据应该如何构建？）。对于使用 NaiveBayes 的多类分类，类 LabeledPoint 包含 label 和 Feature Vector。但是，对于上述情况，数据应该如何构造？

【问题讨论】：

标签： scala apache-spark apache-spark-mllib naivebayes

【解决方案1】：

只需像往常一样构造数据 (LabeledPoint)，但使用多个分类器（例如 OneVsRest），并切换传递给每个分类器的数据（基于您的多个标记向量）。另一种解决方案是获取所有类的概率，而不是获取最可能的 (predict(p.features()))

Vector prediction = model.predictProbabilities(p.features());

然后使用阈值过滤获取最可能的预测。

【讨论】：