lotushy

为什么选择结巴分词

  • 分词效率高
  • 词料库构建时使用的是jieba (python)

结巴分词Java版本

  • 下载
git clone https://github.com/huaban/jieba-analysis
  • 编译
cd jieba-analysis
mvn install
  • 注意
如果mvn版本较高,需要修改pom.xml文件,在plugins前面增加 

solr tokenizer版本

支持solr 6或7或更高

如果你的solr像我一样,版本比较新,需要对代码稍做修改,但改动其实不大。(根据给编译时报的错误做修改即可)

build.gradle的diff

diff --git a/build.gradle b/build.gradle
index 2a87525..06c5cc3 100644
--- a/build.gradle
+++ b/build.gradle
@@ -1,4 +1,4 @@
-group = \'analyzer.solr5\'
+group = \'analyzer.solr7\'
version = \'1.0\'
apply plugin: \'java\'
apply plugin: "eclipse"
@@ -14,15 +14,14 @@ repositories {
dependencies {
testCompile group: \'junit\', name: \'junit\', version: \'4.11\'

- compile("org.apache.lucene:lucene-core:5.0.0")
- compile("org.apache.lucene:lucene-queryparser:5.0.0")
- compile("org.apache.lucene:lucene-analyzers-common:5.0.0")
- compile(\'com.huaban:jieba-analysis:1.0.0\')
-// compile("org.fnlp:fnlp-core:2.0-SNAPSHOT")
+ compile("org.apache.lucene:lucene-core:7.1.0")
+ compile("org.apache.lucene:lucene-queryparser:7.1.0")
+ compile("org.apache.lucene:lucene-analyzers-common:7.1.0")
+ compile files(\'libs/jieba-analysis-1.0.3.jar\')
compile("edu.stanford.nlp:stanford-corenlp:3.5.1")
}

task "create-dirs" << {
sourceSets*.java.srcDirs*.each { it.mkdirs() }
sourceSets*.resources.srcDirs*.each { it.mkdirs() }
-}
\ No newline at end of file
+}

编译

./gladlew build

集成到solr

拷贝jar包到solr的目录下:server/solr-webapp/webapp/WEB-INF/lib

schema修改

    <fieldType name="text_jieba" class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="analyzer.solr7.jieba.JiebaTokenizerFactory"  segMode="SEARCH"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_ch.txt" />
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.SnowballPorterFilterFactory" language="English"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="analyzer.solr7.jieba.JiebaTokenizerFactory"  segMode="SEARCH"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_ch.txt" />
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.SnowballPorterFilterFactory" language="English"/>
      </analyzer>
    </fieldType>

分类:

技术点:

相关文章:

  • 2021-04-16
  • 2021-09-24
  • 2021-04-28
  • 2020-01-14
  • 2022-01-08
  • 2021-11-29
  • 2019-01-05
  • 2021-11-18
猜你喜欢
  • 2021-12-04
  • 2021-09-11
  • 2021-12-25
  • 2021-09-13
  • 2021-08-20
  • 2021-07-21
  • 2021-06-23
相关资源
相似解决方案