Weka - StringtoVector 过滤器不工作答案

【问题标题】：Weka - StringtoVector Filter Not workingWeka - StringtoVector 过滤器不工作
【发布时间】：2015-04-29 10:28:20
【问题描述】：

我正在使用路透社数据练习 Weka。 StringtoVector 分类器用于转换我的字符串数据（如下所示），因此我可以分析文章以了解哪些词可以预测文章类型。如果文章类型为真，则原始数据集表示 TRUE/FALSE，但我将其转换为 0/1。但是，它拒绝在“review”字符串上使用 StringtoVector 过滤器来处理这个 arff 文件。

我在只检查评论属性时使用了以下 StringtoVector 过滤器：

weka.filters.unsupervised.attribute.StringToWordVector -R first-last -W 1000 -prune-rate -1.0 -N 0 -stemmer weka.core.stemmers.NullStemmer -M 1 -tokenizer "weka.core.tokenizers.WordTokenizer -delimiters \" \\r\\n\\t.,;:\\\'\\\"()?!\""

我收到此错误： “问题过滤实例：属性名称不唯一。原因：情绪”仅检查过滤器时。

这是我的数据集/格式的标题，用于一些案例：

@relation text_files
@attribute review string
@attribute sentiment {0, 1}
@data   "cocoa the the cocoa the early the levels its the the this the ended the mln against at the that cocoa the to crop cocoa to crop around mln sales at mln the to this cocoa export the their cocoa prices to to per to offer sales at to dlrs per to to crop sales to at dlrs at dlrs at dlrs per sales at at at at to dlrs at at dlrs the currency sales at to dlrs dlrs dlrs the currency sales at at dlrs at at dlrs at at sales at mln against the crop mln against the the to to the cocoa commission reuter", 0"prices reserve the agriculture department reported the reserve price loan call price price wheat corn 1986 loan call price price reserves grain wheat per reuter", 0"grain crop their products to to wheat export the export wheat oil oil reuter", 0"inc the stock corp its dlrs oil to dlrs production its the company to its to profit to reuter", 0"products stock split products inc its stock split its common shares shareholders the company its to to shareholders at the the stock mln to mln reuter", 0

有人对为什么会发生这种情况有任何想法吗？我认为这可能与数据可能包含 0 和 1 作为文本中自然出现的单词的一部分这一事实存在冲突。我也在想我可能需要在前一个字符串之后的字符串的引号之前有一个额外的空格。

【问题讨论】：

标签： machine-learning weka nlp

【解决方案1】：

避免这些属性名称冲突的最简单的解决方案是为生成的属性使用前缀。

前缀可以通过-P命令行选项，attributeNamePrefix 987654323 @ 987654323或setAttributeNamePrefix 987654324 setAttributeNamePrefix方法来自Java代码。

请参阅StringToWordVector滤波器的Javadoc。

【讨论】：

【解决方案2】：

我也遇到了同样的问题，因为数据中出现了“域”这个词，导致过滤器在识别的时候出现了误会。我的解决方案是从数据中删除所有“域”，并在@attribute 中只保留“域”。

【讨论】：

【解决方案3】：

您好，问题是过滤器将字符串中的每个术语转换为属性。现在，您的数据部分中必须有一个术语“评论”或“情绪”。因此属性是重复的。

因此，请将这两个属性的名称更改为“myreview”和“mysentiment”，或者更改为您的数据中不太可能出现的名称。它应该可以工作。

【讨论】：