20 M 记录的 Lucene 索引需要更多时间答案

【问题标题】：Lucene Indexing with 20 M Records taking more time20 M 记录的 Lucene 索引需要更多时间
【发布时间】：2015-04-03 20:43:42
【问题描述】：

我有以下用于索引的 Lucene 代码，当我使用 100 万条记录运行此代码时 - 它运行速度很快（在 15 秒内建立索引（本地和服务器都具有高配置））。

当我尝试索引 2000 万条记录时，完成索引大约需要 10 分钟。

我在超过 100 GB RAM 的 Linux 服务器上运行这 2000 万条记录。在这种情况下，设置更多的 RAM 缓冲区大小会有所帮助吗？如果是的话，在我的情况下可以设置多少 RAM 大小（我的 RAM 超过 100 GB）

我在本地机器（8 GB RAM）上尝试了相同的 2000 万条记录，花费了相同的 10 分钟，我尝试在本地设置 1 GB RAM 缓冲区大小相同 10 分钟，而没有设置任何 RAM 缓冲区也相同 10 分钟在我的本地机器中记录了 2000 万条记录。

我尝试在linux中不设置RAM缓冲区大小，2000万条记录大约需要8分钟。

final File docDir = new File(docsPath.getFile().getAbsolutePath());
LOG.info("Indexing to directory '" + indexPath + "'...");
Directory dir = FSDirectory.open(new File(indexPath.getFile().getAbsolutePath()));
Analyzer analyzer = null;
IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_47, analyzer);
iwc.setOpenMode(OpenMode.CREATE_OR_APPEND);
iwc.setRAMBufferSizeMB(512.0);
IndexWriter indexWriter = new IndexWriter(dir, iwc);

if (docDir.canRead()) {
    if (docDir.isDirectory()) {
        String[] files = docDir.list();
        if (files != null) {

            for (int i = 0; i < files.length; i++) {
                File file = new File(docDir, files[i]);
                String filePath = file.getPath();
                String delimiter = BatchUtil.getProperty("file.delimiter");
                if (filePath.indexOf("ecid") != -1) {
                    indexEcidFile(indexWriter, file, delimiter);
                } else if (filePath.indexOf("entity") != -1) {
                    indexEntityFile(indexWriter, file, delimiter);
                }
            }
        }
    }
}
indexWriter.forceMerge(2);
indexWriter.close();

以及用于索引的方法之一：

private void indexEntityFile(IndexWriter writer, File file, String delimiter) {

    FileInputStream fis = null;
    try {
        fis = new FileInputStream(file);
        BufferedReader br = new BufferedReader(new InputStreamReader(fis, Charset.forName("UTF-8")));

        Document doc = new Document();
        Field four_pk_Field = new StringField("four_pk", "", Field.Store.NO);
        doc.add(four_pk_Field);
        Field cust_grp_cd_Field = new StoredField("cust_grp_cd", "");
        Field cust_grp_mbrp_id_Field = new StoredField("cust_grp_mbrp_id", "");
        doc.add(cust_grp_cd_Field);
        doc.add(cust_grp_mbrp_id_Field);
        String line = null;

        while ((line = br.readLine()) != null) {

            String[] lineTokens = line.split("\\" + delimiter);
            four_pk_Field.setStringValue(four_pk);
            String cust_grp_cd = lineTokens[4];
            cust_grp_cd_Field.setStringValue(cust_grp_cd);
            String cust_grp_mbrp_id = lineTokens[5];
            cust_grp_mbrp_id_Field.setStringValue(cust_grp_mbrp_id);
            writer.addDocument(doc);
        }
        br.close();
    } catch (FileNotFoundException fnfe) {
        LOG.error("", fnfe);
    } catch (IOException ioe) {
        LOG.error("", ioe);
    } finally {
        try {
            fis.close();
        } catch (IOException e) {
            LOG.error("", e);
        }
    }
}

有什么想法吗？

【问题讨论】：

标签： java linux performance indexing lucene

【解决方案1】：

发生这种情况是因为您尝试在 1 次提交中索引所有 2000 万个文档（而 Lucene 需要在内存中保存所有 2000 万个文档）。应该做些什么来解决它 - 是添加

writer.commit()

在indexEntityFile方法中，每X添加一个文档。 X 可能是 100 万或类似的值

代码可能如下所示（仅显示方法，您需要根据需要修改此代码）

int numberOfDocsInBatch = 0;
...
writer.addDocument(doc);
numberOfDocsInBatch ++;
if (numberOfDocsInBatch == 1_000_000) {
   writer.commit();
   numberOfDocsInBatch = 0;
}

【讨论】：

谢谢@Mysterion .. 实际上我正在处理大约 3.6 亿条记录（两个不同的文件），目前大约需要 40 分钟.. 当我在 10 M 记录中提交一次时，我节省了 5 分钟...谢谢你.. 但是当我尝试在 5 M 记录中提交一次时，又需要 40 分钟来处理。
提交过多也会减慢进程吗？无论如何，我会尝试在 20 M 记录中提交一次，明天我会告诉你结果。
你可以每1M试试
我用 1M 试了下，索引时间也一样（40 分钟），但是搜索速度很慢。我觉得频繁提交不会有帮助。
索引后可以优化索引合并到一个段中，这样可以加快搜索速度