在管道中混合 Smark MLLIB 和 SparkNLP答案

【问题标题】：Mix Smark MLLIB and SparkNLP in pipeline在管道中混合 Smark MLLIB 和 SparkNLP
【发布时间】：2021-11-27 19:11:24
【问题描述】：

在 MLLIB 管道中，如何在 Stemmer（来自 Spark NLP）之后链接 CountVectorizer（来自 SparkML）？

当我尝试在管道中同时使用两者时，我得到：

myColName must be of type equal to one of the following types: [array<string>, array<string>] but was actually of type array<struct<annotatorType:string,begin:int,end:int,result:string,metadata:map<string,string>,embeddings:array<float>>>.

问候，

【问题讨论】：

标签： scala apache-spark apache-spark-mllib johnsnowlabs-spark-nlp

【解决方案1】：

您需要在 Spark NLP 管道中添加 Finisher。试试看：

  val documentAssembler =
    new DocumentAssembler().setInputCol("text").setOutputCol("document")
  val sentenceDetector =
    new SentenceDetector().setInputCols("document").setOutputCol("sentences")
  val tokenizer =
    new Tokenizer().setInputCols("sentences").setOutputCol("token")
  val stemmer = new Stemmer()
    .setInputCols("token")
    .setOutputCol("stem")

  val finisher = new Finisher()
    .setInputCols("stem")
    .setOutputCols("token_features")
    .setOutputAsArray(true)
    .setCleanAnnotations(false)

  val cv = new CountVectorizer()
    .setInputCol("token_features")
    .setOutputCol("features")

  val pipeline = new Pipeline()
    .setStages(
      Array(
        documentAssembler,
        sentenceDetector,
        tokenizer,
        stemmer,
        finisher,
        cv
      ))

val data =
  Seq("Peter Pipers employees are picking pecks of pickled peppers.")
    .toDF("text")

val model = pipeline.fit(data)
val df = model.transform(data)

输出：

+--------------------------------------------------------------------+
|features                                                            |
+--------------------------------------------------------------------+
|(10,[0,1,2,3,4,5,6,7,8,9],[1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0])|
+--------------------------------------------------------------------+

【讨论】：

我一直在寻找这个解决方案，但我阅读了文档并看到重新运行类型是字符串。也许我错过了什么。谢谢