Mallet 中每个主题 p(w|t) 的单词分布答案

【问题标题】：Distribution of words per topic p(w|t) in MalletMallet 中每个主题 p(w|t) 的单词分布
【发布时间】：2016-06-10 23:49:06
【问题描述】：

我需要获取 Mallet 在 Java 中找到的每个主题的单词分布（而不是在 CLI 中，如 how to get a probability distribution for a topic in mallet? 中所要求的那样）。举例说明我的意思：Introduction to Latent Dirichlet Allocation:

Topic A: 30% broccoli, 15% bananas, 10% breakfast, 10% munching, … (at which point, you could interpret topic A to be about food)
Topic B: 20% chinchillas, 20% kittens, 20% cute, 15% hamster, … (at which point, you could interpret topic B to be about cute animals)

Mallet 为每个主题提供令牌“权重”，在http://comments.gmane.org/gmane.comp.ai.mallet.devel/2064 中，有人试图编写一种方法来获取 Mallet 的每个主题的单词分布。

我修改了方法，使所有权重除以它们的总和，如上面的邮件列表中所述。

以下方法（添加到 ParallelTopicModel.java 时）是否正确计算 Mallet 中每个主题 p(w|t) 的单词分布？

/**
 * Get the normalized topic word weights (weights sum up to 1.0)
 * @param topic the topic
 * @return the normalized topic word weights (weights sum up to 1.0)
 */
public ArrayList<double[]> getNormalizedTopicWordWeights(int topic) {
    ArrayList<double[]> tokenWeights = new ArrayList<double[]>();
    for (int type = 0; type < numTypes; type++) {
        int[] topicCounts = typeTopicCounts[type];
        double weight = beta;
        int index = 0;
        while (index < topicCounts.length && topicCounts[index] > 0) {
            int currentTopic = topicCounts[index] & topicMask;
            if (currentTopic == topic) {
                weight += topicCounts[index] >> topicBits;
                break;
            }
            index++;
        }
        double[] tokenAndWeight = { (double) type, weight };
        tokenWeights.add(tokenAndWeight);
    }
    // normalize
    double sum = 0;
    // get the sum
    for (double[] tokenAndWeight : tokenWeights) {
        sum += tokenAndWeight[1];
    }
    // divide each element by the sum
    ArrayList<double[]> normalizedTokenWeights = new ArrayList<double[]>();
    for (double[] tokenAndWeight : tokenWeights) {
        tokenAndWeight[1] = tokenAndWeight[1]/sum;
        normalizedTokenWeights.add(tokenAndWeight);
    }
    return normalizedTokenWeights;
}

【问题讨论】：

标签： java nlp topic-modeling mallet

【解决方案1】：

这看起来可行，但我有一些风格上的 cmets。

我并不热衷于使用 double 数组来表示主题/权重对。如果您要遍历所有类型，为什么不使用密集的double[] 数组并将类型作为索引？如果您需要在此方法之外的另一种方法中对条目进行排序，ArrayList 可能有意义，但未标准化的中间 ArrayList 似乎很浪费。

第二个求和循环似乎没有必要。您可以先将sum 初始化为numTypes * beta，然后仅在遇到非零计数类型时才添加weight - beta。

如果您定义 normalizer = 1.0/sum 然后在规范化循环中进行乘法而不是除法，通常会产生明显的差异。

【讨论】：