【问题标题】:Mix Smark MLLIB and SparkNLP in pipeline在管道中混合 Smark MLLIB 和 SparkNLP
【发布时间】:2021-11-27 19:11:24
【问题描述】:

在 MLLIB 管道中,如何在 Stemmer(来自 Spark NLP)之后链接 CountVectorizer(来自 SparkML)?

当我尝试在管道中同时使用两者时,我得到:

myColName must be of type equal to one of the following types: [array<string>, array<string>] but was actually of type array<struct<annotatorType:string,begin:int,end:int,result:string,metadata:map<string,string>,embeddings:array<float>>>.

问候,

【问题讨论】:

    标签: scala apache-spark apache-spark-mllib johnsnowlabs-spark-nlp


    【解决方案1】:

    您需要在 Spark NLP 管道中添加 Finisher。试试看:

      val documentAssembler =
        new DocumentAssembler().setInputCol("text").setOutputCol("document")
      val sentenceDetector =
        new SentenceDetector().setInputCols("document").setOutputCol("sentences")
      val tokenizer =
        new Tokenizer().setInputCols("sentences").setOutputCol("token")
      val stemmer = new Stemmer()
        .setInputCols("token")
        .setOutputCol("stem")
    
      val finisher = new Finisher()
        .setInputCols("stem")
        .setOutputCols("token_features")
        .setOutputAsArray(true)
        .setCleanAnnotations(false)
    
      val cv = new CountVectorizer()
        .setInputCol("token_features")
        .setOutputCol("features")
    
      val pipeline = new Pipeline()
        .setStages(
          Array(
            documentAssembler,
            sentenceDetector,
            tokenizer,
            stemmer,
            finisher,
            cv
          ))
    
    val data =
      Seq("Peter Pipers employees are picking pecks of pickled peppers.")
        .toDF("text")
    
    val model = pipeline.fit(data)
    val df = model.transform(data)
    

    输出:

    +--------------------------------------------------------------------+
    |features                                                            |
    +--------------------------------------------------------------------+
    |(10,[0,1,2,3,4,5,6,7,8,9],[1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0])|
    +--------------------------------------------------------------------+
    

    【讨论】:

    • 我一直在寻找这个解决方案,但我阅读了文档并看到重新运行类型是字符串。也许我错过了什么。谢谢
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2012-01-28
    • 2016-12-10
    • 2015-03-19
    • 2017-11-18
    • 1970-01-01
    相关资源
    最近更新 更多