问题理解卡方特征选择答案

【问题标题】：Problem understanding chi-squared feature selection问题理解卡方特征选择
【发布时间】：2011-07-01 16:50:59
【问题描述】：

我在理解卡方特征选择方面遇到了问题。我有两个类，正面和负面，每个包含不同的术语和术语计数。我需要执行卡方特征选择来为每个类提取最具代表性的术语。问题是我最终得到了我的正面和负面课程完全相同的术语。这是我用于选择功能的 Python 代码：

#!/usr/bin/python

# import the necessary libraries
import math

class ChiFeatureSelector:
    def __init__(self, extCorpus, lookupCorpus):
        # store the extraction corpus and lookup corpus
        self.extCorpus = extCorpus
        self.lookupCorpus = lookupCorpus

    def select(self, outPath):
            # dictionary of chi-squared scores
        scores = {}

        # loop over the words in the extraction corpus
        for w in self.extCorpus.getTerms():
            # build the chi-squared table
            n11 = float(self.extCorpus.getTermCount(w))
            n10 = float(self.lookupCorpus.getTermCount(w))
            n01 = float(self.extCorpus.getTotalDocs() - n11)
            n00 = float(self.lookupCorpus.getTotalDocs() - n10)

            # perform the chi-squared calculation and store
            # the score in the dictionary
            a = n11 + n10 + n01 + n00
            b = ((n11 * n00) - (n10 * n01)) ** 2
            c = (n11 + n01) * (n11 + n10) * (n10 + n00) * (n01 + n00)
            chi = (a * b) / c
            scores[w] = chi

        # sort the scores in descending order
        scores = sorted([(v, k) for (k, v) in scores.items()], reverse = True)
        i = 0

        for (v, k) in scores:
            print str(k) + " : " + str(v)
            i += 1

            if i == 10:
                break

这就是我使用类的方式（为简洁起见省略了一些代码，是的，我已经检查以确保两个语料库不包含完全相同的数据。

# perform positive ngram feature selection
print "positive:\n"
f = ChiFeatureSelector(posCorpus, negCorpus)
f.select(posOutputPath)

print "\nnegative:\n"
# perform negative ngram feature selection
f = ChiFeatureSelector(negCorpus, posCorpus)
f.select(negOutputPath)

我觉得错误来自我计算术语/文档表时，但我不确定。也许我不明白一些事情。有人能指出我正确的方向吗？

【问题讨论】：

你能添加一些来自 extCorpus 和 lookupCorpus 的样本数据吗？足以看到结构......
对不起，negCorpus 和 posCorpus

标签： python statistics information-retrieval chi-squared

【解决方案1】：

在二分类的情况下，如果两个特征的卡方排序是相同的交换数据集。它们是两者之间差异最大的特征两个类。

【讨论】：

+1。特征选择不会给你“强烈肯定”和“强烈否定”的特征，而是强烈区分的。顺便说一句，在多类情况下也是如此。