【问题标题】:CountVectorizerModel error with apache spark - Java API使用 apache spark 的 CountVectorizerModel 错误 - Java API
【发布时间】:2016-01-21 23:56:36
【问题描述】:

我正在使用 Apache Spark 的示例代码遵循文档:https://spark.apache.org/docs/latest/ml-features.html#countvectorizer

    import java.util.Arrays;
    import org.apache.spark.SparkConf;
    import org.apache.spark.api.java.JavaSparkContext;
    import org.apache.spark.ml.feature.CountVectorizer;
    import org.apache.spark.ml.feature.CountVectorizerModel;
    import org.apache.spark.sql.DataFrame;
    import org.apache.spark.sql.Row;
    import org.apache.spark.sql.RowFactory;
    import org.apache.spark.sql.SQLContext;
    import org.apache.spark.sql.types.*;
    public class CountVectorizer_Demo {
    public static void main(String[] args) {
        SparkConf conf = new SparkConf().setAppName("LDA Online").setMaster(
                "local");
        JavaSparkContext sc = new JavaSparkContext(conf);

        SQLContext sqlContext = new SQLContext(sc);

        // Input data: Each row is a bag of words from a sentence or document.
        JavaRDD<Row> jrdd = sc.parallelize(Arrays.asList(
          RowFactory.create(Arrays.asList("a", "b", "c")),
          RowFactory.create(Arrays.asList("a", "b", "b", "c", "a"))
        ));
        StructType schema = new StructType(new StructField [] {
          new StructField("text", new ArrayType(DataTypes.StringType, true), false, Metadata.empty())
        });
        DataFrame df = sqlContext.createDataFrame(jrdd, schema);

        // fit a CountVectorizerModel from the corpus
        CountVectorizerModel cvModel = new CountVectorizer()
          .setInputCol("text")
          .setOutputCol("feature")
          .setVocabSize(3)
          .setMinDF(2) // a term must appear in more or equal to 2 documents to be included in the vocabulary
          .fit(df);

        // alternatively, define CountVectorizerModel with a-priori vocabulary
        CountVectorizerModel cvm = new CountVectorizerModel(new String[]{"a", "b", "c"})
          .setInputCol("text")
          .setOutputCol("feature");

        cvModel.transform(df).show();
    }
}

但我收到错误消息:

15/10/22 23:04:20 INFO BlockManagerMasterActor: 使用 703.6 MB RAM,BlockManagerId(, localhost, 56882) 注册块管理器 localhost:56882 15/10/22 23:04:20 INFO BlockManagerMaster:已注册的 BlockManager 线程“主”java.lang.NoClassDefFoundError 中的异常:org/apache/spark/sql/catalyst/InternalRow 在 org.apache.spark.ml.feature.CountVectorizerParams$class.validateAndTransformSchema(CountVectorizer.scala:72) 在 org.apache.spark.ml.feature.CountVectorizer.validateAndTransformSchema(CountVectorizer.scala:107) 在 org.apache.spark.ml.feature.CountVectorizer.transformSchema(CountVectorizer.scala:168) 在 org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:62) 在 org.apache.spark.ml.feature.CountVectorizer.fit(CountVectorizer.scala:130) 在 main.CountVectorizer_Demo.main(CountVectorizer_Demo.java:39) 引起:java.lang.ClassNotFoundException:org.apache.spark.sql.catalyst.InternalRow 在 java.net.URLClassLoader$1.run(URLClassLoader.java:366) 在 java.net.URLClassLoader$1.run(URLClassLoader.java:355) 在 java.security.AccessController.doPrivileged(本机方法) 在 java.net.URLClassLoader.findClass(URLClassLoader.java:354) 在 java.lang.ClassLoader.loadClass(ClassLoader.java:425) 在 sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) 在 java.lang.ClassLoader.loadClass(ClassLoader.java:358) ... 6 更多

提前致谢。

【问题讨论】:

    标签: java apache-spark apache-spark-mllib


    【解决方案1】:

    非常感谢大家。我通过添加依赖解决了我的问题:

    <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-catalyst_2.10</artifactId>
        <version>1.5.1</version>
    </dependency>
    

    【讨论】:

      猜你喜欢
      • 2020-12-17
      • 2018-01-09
      • 1970-01-01
      • 1970-01-01
      • 2018-11-03
      • 2015-09-01
      • 1970-01-01
      • 1970-01-01
      • 2019-05-16
      相关资源
      最近更新 更多