Spark 中的潜在狄利克雷分配 (LDA)答案

【问题标题】：Latent Dirichlet allocation (LDA) in SparkSpark 中的潜在狄利克雷分配 (LDA)
【发布时间】：2017-02-05 10:54:18
【问题描述】：

我正在尝试在 Spark 中编写一个用于执行潜在狄利克雷分配 (LDA) 的程序。这个 Spark 文档page 提供了一个很好的示例，用于对样本数据执行 LDA。下面是程序

from pyspark.mllib.clustering import LDA, LDAModel
from pyspark.mllib.linalg import Vectors

# Load and parse the data
data = sc.textFile("data/mllib/sample_lda_data.txt")
parsedData = data.map(lambda line: Vectors.dense([float(x) for x in line.strip().split(' ')]))
# Index documents with unique IDs
corpus = parsedData.zipWithIndex().map(lambda x: [x[1], x[0]]).cache()

# Cluster the documents into three topics using LDA
ldaModel = LDA.train(corpus, k=3)

# Output topics. Each is a distribution over words (matching word count vectors)
print("Learned topics (as distributions over vocab of " + str(ldaModel.vocabSize())
      + " words):")
topics = ldaModel.topicsMatrix()
for topic in range(3):
    print("Topic " + str(topic) + ":")
    for word in range(0, ldaModel.vocabSize()):
        print(" " + str(topics[word][topic]))

# Save and load model
ldaModel.save(sc, "target/org/apache/spark/PythonLatentDirichletAllocationExample/LDAModel")
sameModel = LDAModel\
    .load(sc, "target/org/apache/spark/PythonLatentDirichletAllocationExample/LDAModel")

使用的样本输入（sample_lda_data.txt）如下

1 2 6 0 2 3 1 1 0 0 3
1 3 0 1 3 0 0 2 0 0 1
1 4 1 0 0 4 9 0 1 2 0
2 1 0 3 0 0 5 0 2 3 9
3 1 1 9 3 0 2 0 0 1 3
4 2 0 3 4 5 1 1 1 4 0
2 1 0 3 0 0 5 0 2 2 9
1 1 1 9 2 1 2 0 0 1 3
4 4 0 3 4 2 1 3 0 0 0
2 8 2 0 3 0 2 0 2 7 2
1 1 1 9 0 2 2 0 0 3 3
4 1 0 0 4 5 1 3 0 1 0

如何修改程序以在包含文本数据而不是数字的文本数据文件上运行？让示例文件包含以下文本。

潜在狄利克雷分配 (LDA) 是一个主题模型，它推断来自文本文档集合的主题。 LDA 可以被认为是聚类算法如下：

主题对应聚类中心，文档对应数据集中的示例（行）。主题和文档都存在于一个特征空间，其中特征向量是字数的向量（bag 词）。而不是使用传统的方法来估计聚类距离，LDA 使用基于统计模型的函数生成文档。

【问题讨论】：

标签： python pyspark lda

【解决方案1】：

在做了一些研究之后，我试图回答这个问题。下面是使用 Spark 对包含真实文本数据的文本文档执行 LDA 的示例代码。

from pyspark.sql import SQLContext, Row
from pyspark.ml.feature import CountVectorizer
from pyspark.mllib.clustering import LDA, LDAModel
from pyspark.mllib.linalg import Vector, Vectors

path = "sample_text_LDA.txt"

data = sc.textFile(path).zipWithIndex().map(lambda (words,idd): Row(idd= idd, words = words.split(" ")))
docDF = spark.createDataFrame(data)
Vector = CountVectorizer(inputCol="words", outputCol="vectors")
model = Vector.fit(docDF)
result = model.transform(docDF)

corpus = result.select("idd", "vectors").rdd.map(lambda (x,y): [x,Vectors.fromML(y)]).cache()

# Cluster the documents into three topics using LDA
ldaModel = LDA.train(corpus, k=3,maxIterations=100,optimizer='online')
topics = ldaModel.topicsMatrix()
vocabArray = model.vocabulary

wordNumbers = 10  # number of words per topic
topicIndices = sc.parallelize(ldaModel.describeTopics(maxTermsPerTopic = wordNumbers))

def topic_render(topic):  # specify vector id of words to actual words
    terms = topic[0]
    result = []
    for i in range(wordNumbers):
        term = vocabArray[terms[i]]
        result.append(term)
    return result

topics_final = topicIndices.map(lambda topic: topic_render(topic)).collect()

for topic in range(len(topics_final)):
    print ("Topic" + str(topic) + ":")
    for term in topics_final[topic]:
        print (term)
    print ('\n')

问题中提到的从文本数据中提取的主题如下：

【讨论】：

如何打印每个词条的概率值？
如果您要总结一下您对 spark 文档中的原始 LDA 代码所做的更改，这将有所帮助。
有没有办法获取主题中的文档？
@CpILL 这可能会有所帮助stackoverflow.com/questions/33072449/…
@prashanth optimizer='online' 是什么意思是它特定于您正在运行的文档吗？