【发布时间】:2015-04-29 10:28:20
【问题描述】:
我正在使用路透社数据练习 Weka。 StringtoVector 分类器用于转换我的字符串数据(如下所示),因此我可以分析文章以了解哪些词可以预测文章类型。如果文章类型为真,则原始数据集表示 TRUE/FALSE,但我将其转换为 0/1。但是,它拒绝在“review”字符串上使用 StringtoVector 过滤器来处理这个 arff 文件。
我在只检查评论属性时使用了以下 StringtoVector 过滤器:
weka.filters.unsupervised.attribute.StringToWordVector -R first-last -W 1000 -prune-rate -1.0 -N 0 -stemmer weka.core.stemmers.NullStemmer -M 1 -tokenizer "weka.core.tokenizers.WordTokenizer -delimiters \" \\r\\n\\t.,;:\\\'\\\"()?!\""
我收到此错误: “问题过滤实例:属性名称不唯一。原因:情绪”仅检查过滤器时。
这是我的数据集/格式的标题,用于一些案例:
@relation text_files
@attribute review string
@attribute sentiment {0, 1}
@data "cocoa the the cocoa the early the levels its the the this the ended the mln against at the that cocoa the to crop cocoa to crop around mln sales at mln the to this cocoa export the their cocoa prices to to per to offer sales at to dlrs per to to crop sales to at dlrs at dlrs at dlrs per sales at at at at to dlrs at at dlrs the currency sales at to dlrs dlrs dlrs the currency sales at at dlrs at at dlrs at at sales at mln against the crop mln against the the to to the cocoa commission reuter", 0"prices reserve the agriculture department reported the reserve price loan call price price wheat corn 1986 loan call price price reserves grain wheat per reuter", 0"grain crop their products to to wheat export the export wheat oil oil reuter", 0"inc the stock corp its dlrs oil to dlrs production its the company to its to profit to reuter", 0"products stock split products inc its stock split its common shares shareholders the company its to to shareholders at the the stock mln to mln reuter", 0
有人对为什么会发生这种情况有任何想法吗?我认为这可能与数据可能包含 0 和 1 作为文本中自然出现的单词的一部分这一事实存在冲突。我也在想我可能需要在前一个字符串之后的字符串的引号之前有一个额外的空格。
【问题讨论】:
标签: machine-learning weka nlp