【发布时间】:2021-01-06 02:34:54
【问题描述】:
我使用 pyspark 训练了一个 LDA 模型,以按主题对文本进行分类,尝试了不同的 K 值。但是,要验证选定的 K,我想使用这种方法 evaluate-topic-model-in-python-latent-dirichlet-allocation-lda
但是,对于 spark.ml,我不知道如何获得等效的 gensim CoherenceModel。
数据框如下所示:
tokenizedText.show(truncate=True, n=5)
+------------+--------------------+
| ID| Tokens|
+------------+--------------------+
|0000qaqdWUAQ|[limpieza, mala, ...|
|0000qaqe2UAA|[transporte, deja...|
|0000qasxUUAQ| [correcto]|
|0000qatEJUAY| [bien]|
|0000qaqwMUAQ|[experiencia, agr...|
+------------+--------------------+
基本模型是这样的:
from pyspark.ml.feature import IDF, HashingTF, Tokenizer, StopWordsRemover, CountVectorizer
from pyspark.ml.clustering import LDA, LDAModel
counter = CountVectorizer(inputCol="Tokens", outputCol="term_frequency", minDF=5)
counterModel = counter.fit(tokenizedText)
vectorizedLaw = counterModel.transform(trainingData)
idf = IDF(inputCol="term_frequency", outputCol="tf_idf")
tfidfLaw = idf.fit(vectorizedLaw).transform(vectorizedLaw)
lda = LDA(k=7, maxIter=50, featuresCol="tf_idf", seed=1234)
model = lda.fit(tfidfLaw)
我得到:
model.logLikelihood(tfidfLaw)
Out[295]: -17745244.739330653
model.logPerplexity(tfidfLaw)
Out[296]: 7.63661972904619
使用 gensim 并按照evaluate-topic-model-in-python-latent-dirichlet-allocation-lda(计算模型的困惑度和连贯性分数以及超参数调整)示例,
由于数据量大,它不可行。经过长时间的执行,我得到了这个错误:
Internal error, sorry. Attach your notebook to a different cluster or restart the current cluster.
java.net.NoRouteToHostException: No route to host
at java.base/sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at java.base/sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:779)
at shaded.v9_4.org.eclipse.jetty.io.SelectorManager.doFinishConnect(SelectorManager.java:355)
at shaded.v9_4.org.eclipse.jetty.io.ManagedSelector.processConnect(ManagedSelector.java:232)
at shaded.v9_4.org.eclipse.jetty.io.ManagedSelector.access$1400(ManagedSelector.java:62)
at shaded.v9_4.org.eclipse.jetty.io.ManagedSelector$SelectorProducer.processSelected(ManagedSelector.java:543)
at shaded.v9_4.org.eclipse.jetty.io.ManagedSelector$SelectorProducer.produce(ManagedSelector.java:401)
at shaded.v9_4.org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.produceTask(EatWhatYouKill.java:360)
at shaded.v9_4.org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:184)
at shaded.v9_4.org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:171)
at shaded.v9_4.org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:129)
at shaded.v9_4.org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:367)
at shaded.v9_4.org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:782)
at shaded.v9_4.org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:914)
at java.base/java.lang.Thread.run(Thread.java:834)
我在 Databricks 运行时版本 6.5 ML(包括 Apache Spark 2.4.5、Scala 2.11)上运行,驱动程序类型:15.3 GB 内存、2 个内核、1 个 DBU。
您是否知道使用 pyspark.ml @ @ Gensim 一致性,以获得适当的选择得分避免执行问题? .
【问题讨论】:
-
我也面临同样的问题。有人可以帮忙吗?谢谢
标签: python apache-spark pyspark nlp databricks