lucene 学习之编码篇

本文环境：lucene5.2 JDK1.7 IKAnalyzer

引入lucene相关包

<!-- lucene核心包 -->
    <dependency>
        <groupId>org.apache.lucene</groupId>
        <artifactId>lucene-core</artifactId>
        <version>5.2.0</version>
    </dependency>
      <!-- 查询解析器 -->
    <dependency>
        <groupId>org.apache.lucene</groupId>
        <artifactId>lucene-queryparser</artifactId>
        <version>5.2.0</version>
    </dependency>
      <!-- 分词器 -->
    <dependency>
        <groupId>org.apache.lucene</groupId>
        <artifactId>lucene-analyzers-common</artifactId>
        <version>5.2.0</version>
    </dependency>

开发中依赖的包

<!-- https://mvnrepository.com/artifact/commons-io/commons-io -->
    <dependency>
        <groupId>commons-io</groupId>
        <artifactId>commons-io</artifactId>
        <version>2.4</version>
    </dependency>
    
    <!-- https://mvnrepository.com/artifact/junit/junit -->
    <dependency>
        <groupId>junit</groupId>
        <artifactId>junit</artifactId>
        <version>4.10</version>
    </dependency>

一、创建索引

1、确定索引库的位置

a、将索引库存入本地磁盘

FSDirectory dir=FSDirectory.open(path);

b、将索引存入内存

Directory directory = new RAMDirectory();

2、创建分词器

//创建分词器
        Analyzer al=new StandardAnalyzer();

lucene内置有四个分析器：WhitespaceAnalyzer、SimpleAnalyzer、StopAnalyser、StandardAnalyzer

WhitespaceAnalyzer：分析器是通过空格来分割文本信息

SimpleAnalyzer：分析器会首先通过非字母字符来拆分文本信息，并统一转为小写格式，会去掉数字类型的字符

StopAnalyser：和SimpleAnalyzer分析器类似，但StopAnalyser会去掉一些常用单词（the、a、an..）

StandardAnalyzer：是lucene最复杂的核心分析器，可以识别某些种类的语汇单元，如公司名称、Email、主机名称等，它会将语汇单元转为小写格式，并去除掉停用词和标点符号

3、创建IndexWriter，进行索引文件的写入。

//创建索引的写入配置对象
        IndexWriterConfig iwc= new IndexWriterConfig(al);
        //创建索引的Writer
        IndexWriter iw=new IndexWriter(dir, iwc);

4、创建文档创建域将内容提取并进行索引的存储

//创建文档
            Document doc=new Document();
            //创建域 （域是键值对的数据结构）Store.YES：将该值存储到索引库
            Field fieldName=new TextField("fieldName","xs.txt",Store.YES);
            Field fieldContent=new TextField("fieldContent","san guo yan yi",Store.YES);
            Field fieldsize=new LongField("fieldSize",10324,Store.YES);
            Field fieldPath=new TextField("fieldPath","F:/xs/sg/xs.txt",Store.YES);
            //将域加入文档中
            doc.add(fieldName);
            doc.add(fieldContent);
            doc.add(fieldsize);
            doc.add(fieldPath);
            //把文档写入索引库
            iw.addDocument(doc);

Field域的3各重要属性：

a、是否分析

　　将field值按照指定的分词器进行分析出相应的语汇单元，将词进行索引。例如：博文标题、博文作者、博文描述、博文内容，这些都应该建立索引

b、是否索引

　　对field分析后的词或整个field值进行索引，只有建立索引的field才能被搜索

c、是否存储（Store.YES：表示存储 Store.NO:表示不存储）

　　将field值存储在文档中，只有存储在文档中的field才可以从Document中取出。（一般对于内容较大的field不建立存储）

常用Field域的类型：

lucene 学习之编码篇

5、提交，并关闭资源

//提交
        iw.commit();
        iw.close();

完整代码：

 1 @Test
 2     public void ImportIndex() throws IOException {
 3         //获得索引库路径
 4         Path path=Paths.get("E:\\test\\luceneWI");
 5         //打开索引库
 6         FSDirectory dir=FSDirectory.open(path);
 7         //创建分词器
 8         Analyzer al=new StandardAnalyzer();
 9         //创建索引的写入配置对象
10         IndexWriterConfig iwc= new IndexWriterConfig(al);
11         //创建索引的Writer
12         IndexWriter iw=new IndexWriter(dir, iwc);
13         //采集原始文档
14         File sourceFile=new File("E:\\test\\lucene");
15         //获取该文件下所有的文件
16         File [] files=sourceFile.listFiles();
17         //遍历每一个文件
18         for(File file:files){
19             //获取文件属性
20             String fileName=file.getName();
21             String content=FileUtils.readFileToString(file);
22             long size=FileUtils.sizeOf(file);
23             String sourcePath=file.getPath();
24             //创建文档
25             Document doc=new Document();
26             //创建域 （域是键值对的数据结构）Store.YES：将该值存储到索引库
27             Field fieldName=new TextField("fieldName",fileName,Store.YES);
28             Field fieldContent=new TextField("fieldContent",content,Store.YES);
29             Field fieldsize=new LongField("fieldSize",size,Store.YES);
30             Field fieldPath=new TextField("fieldPath",sourcePath,Store.NO);
31             //将域加入文档中
32             doc.add(fieldName);
33             doc.add(fieldContent);
34             doc.add(fieldsize);
35             doc.add(fieldPath);
36             //把文档写入索引库
37             iw.addDocument(doc);
38         }
39         //提交
40         iw.commit();
41         iw.close();
42     }

View Code