一个简单的词干算法，使用字符串作为输入答案

【问题标题】：A simple stemming algorithm with String for input一个简单的词干算法，使用字符串作为输入
【发布时间】：2014-03-25 14:23:05
【问题描述】：

我一直在研究诸如 porter 算法之类的词干算法，但到目前为止我发现的所有内容都将文件作为输入处理。

是否有任何现有的算法可以让我简单地向词干分析器传递一个字符串，并让它返回词干提取的字符串？

类似：

String toBeStemmed = "The man worked tirelessly";
Stemmer s = new Stemmer();

String stemmed = s.stem(toBeStemmed);

【问题讨论】：

一个关于搬运工的好网站是tartarus.org/martin/PorterStemmer

标签： java algorithm stemming porter-stemmer

【解决方案1】：

算法本身不获取文件。代码可能会获取文件并将其作为一系列字符串读入，这些字符串会被馈送到算法中。您只需要查看从文件中读取字符串的部分代码，并以类似的方式将字符串传递给自己。

【讨论】：

【解决方案2】：

在您的示例中，toBeStemmed 是一个句子，您首先要对其进行标记。然后你会阻止单个标记/单词，例如“工作”或“不知疲倦”。

这是我在一些项目中用作词干分析器的精细形态分析器。

词干 JAR：https://code.google.com/p/hunglish-webapp/source/browse/trunk/#trunk%2Flib%2Fnet%2Fsf%2Fjhunlang%2Fjmorph%2F1.0
词干来源：https://code.google.com/p/j-morph/source/checkout
语言资源文件：https://code.google.com/p/hunglish-webapp/source/browse/trunk/#trunk%2Fsrc%2Fmain%2Fresources%2Fresources-lang%2Fjmorph
我如何将它与 Lucene 一起使用：https://code.google.com/p/hunglish-webapp/source/browse/trunk/src/main/java/hu/mokk/hunglish/jmorph/
属性文件：https://code.google.com/p/hunglish-webapp/source/browse/trunk/src/main/resources/META-INF/spring/stemmer.properties

示例用法：

import net.sf.jhunlang.jmorph.lemma.Lemma;
import net.sf.jhunlang.jmorph.lemma.Lemmatizer;
import net.sf.jhunlang.jmorph.analysis.Analyser;
import net.sf.jhunlang.jmorph.analysis.AnalyserContext;
import net.sf.jhunlang.jmorph.analysis.AnalyserControl;
import net.sf.jhunlang.jmorph.factory.Definition;
import net.sf.jhunlang.jmorph.factory.JMorphFactory;
import net.sf.jhunlang.jmorph.parser.ParseException;
import net.sf.jhunlang.jmorph.sample.AnalyserConfig;
import net.sf.jhunlang.jmorph.sword.parser.EnglishAffixReader;
import net.sf.jhunlang.jmorph.sword.parser.EnglishReader;
import net.sf.jhunlang.jmorph.sword.parser.SwordAffixReader;
import net.sf.jhunlang.jmorph.sword.parser.SwordReader;

AnalyserConfig acEn = new AnalyserConfig();
//TODO: set path to the English affix file
String enAff = "src/main/resources/resources-lang/jmorph/en.aff"; 
Definition affixDef = acEn.createDefinition(enAff, "utf-8", EnglishAffixReader.class);
//TODO set path to the English dict file
String enDic = "src/main/resources/resources-lang/jmorph/en.dic"; 
Definition dicDef = acEn.createDefinition(enDic, "utf-8", EnglishReader.class);
int enRecursionDepth = 3;
acEn.setRecursionDepth(affixDef, enRecursionDepth);
JMorphFactory jf = new JMorphFactory();
Analyser enAnalyser = jf.build(new Definition[] { affixDef, dicDef });
AnalyserControl acEn = new AnalyserControl(AnalyserControl.ALL_COMPOUNDS);
AnalyserContext analyserContextEn = new AnalyserContext(acEn);
boolean enStripDerivates = true;
Lemmatizer enLemmatizer = new net.sf.jhunlang.jmorph.lemma.LemmatizerImpl(enAnalyser, enStripDerivates, analyserContextEn);


//After somewhat complex initializing, here we go
List<Lemma> lemmas = enLemmatizer.lemmatize("worked");
for (Lemma lemma : lemmas) {
    System.out.println(lemma.getWord());
}

【讨论】：