【发布时间】:2017-06-27 17:30:45
【问题描述】:
我想使用Stanford Classifier 进行文本分类。我的特征主要是文本的,但也有一些数字特征(例如句子的长度)。
我从ClassifierExample 开始,如果停车灯是BROKEN 和0.1,则用一个简单的实值特征F 替换当前特征值100,否则会产生以下代码(除了第10-16行的makeStopLights()函数外,这只是原始ClassifierExample类的代码):
public class ClassifierExample {
protected static final String GREEN = "green";
protected static final String RED = "red";
protected static final String WORKING = "working";
protected static final String BROKEN = "broken";
private ClassifierExample() {} // not instantiable
// the definition of this function was changed!!
protected static Datum<String,String> makeStopLights(String ns, String ew) {
String label = (ns.equals(ew) ? BROKEN : WORKING);
Counter<String> counter = new ClassicCounter<>();
counter.setCount("F", (label.equals(BROKEN)) ? 100 : 0.1);
return new RVFDatum<>(counter, label);
}
public static void main(String[] args) {
// Create a training set
List<Datum<String,String>> trainingData = new ArrayList<>();
trainingData.add(makeStopLights(GREEN, RED));
trainingData.add(makeStopLights(GREEN, RED));
trainingData.add(makeStopLights(GREEN, RED));
trainingData.add(makeStopLights(RED, GREEN));
trainingData.add(makeStopLights(RED, GREEN));
trainingData.add(makeStopLights(RED, GREEN));
trainingData.add(makeStopLights(RED, RED));
// Create a test set
Datum<String,String> workingLights = makeStopLights(GREEN, RED);
Datum<String,String> brokenLights = makeStopLights(RED, RED);
// Build a classifier factory
LinearClassifierFactory<String,String> factory = new LinearClassifierFactory<>();
factory.useConjugateGradientAscent();
// Turn on per-iteration convergence updates
factory.setVerbose(true);
//Small amount of smoothing
factory.setSigma(10.0);
// Build a classifier
LinearClassifier<String,String> classifier = factory.trainClassifier(trainingData);
// Check out the learned weights
classifier.dump();
// Test the classifier
System.out.println("Working instance got: " + classifier.classOf(workingLights));
classifier.justificationOf(workingLights);
System.out.println("Broken instance got: " + classifier.classOf(brokenLights));
classifier.justificationOf(brokenLights);
}
}
在我对线性分类器的理解中,F 特征应该让分类任务变得相当容易——毕竟,我们只需要检查F 的值是否大于某个阈值。但是,分类器在测试集中的每个实例上都返回 WORKING。
现在我的问题是:我是否做错了什么,我是否还需要更改代码的其他部分才能使实值功能正常工作,还是我对线性分类器的理解有问题?
【问题讨论】:
标签: java machine-learning classification stanford-nlp text-classification