如何配置 Solr 以提高索引速度答案

【问题标题】：How to configure Solr for improved indexing speed如何配置 Solr 以提高索引速度
【发布时间】：2013-03-18 01:25:23
【问题描述】：

我有一个客户端程序，它生成 1-50 百万个 Solr 文档并将它们添加到 Solr。
我正在使用 ConcurrentUpdateSolrServer 从客户端推送文档，每个请求 1000 个文档。
文档相对较小（很少有小文本字段）。
我想提高索引速度。
我尝试将“ramBufferSizeMB”增加到 1G，将“mergeFactor”增加到 25，但没有看到任何变化。
我想知道是否有其他推荐的设置来提高 Solr 索引速度。
任何指向相关材料的链接将不胜感激。

【问题讨论】：

标签： solr solrj solr4

【解决方案1】：

您似乎正在将数据批量导入 Solr，因此您无需立即搜索任何数据。

首先，您可以增加每个请求的文档数量。由于您的文档很小，我什至会将其增加到每个请求 100K 文档或更多并尝试。

其次，您希望减少批量索引时发生的提交次数。在你的 solrconfig.xml 中寻找：

<!-- AutoCommit

     Perform a hard commit automatically under certain conditions.
     Instead of enabling autoCommit, consider using "commitWithin"
     when adding documents.

     http://wiki.apache.org/solr/UpdateXmlMessages

     maxDocs - Maximum number of documents to add since the last
               commit before automatically triggering a new commit.

     maxTime - Maximum amount of time in ms that is allowed to pass
               since a document was added before automatically
               triggering a new commit.

     openSearcher - if false, the commit causes recent index changes
     to be flushed to stable storage, but does not cause a new
     searcher to be opened to make those changes visible.
  -->
 <autoCommit>
   <maxTime>15000</maxTime>
   <openSearcher>false</openSearcher>
 </autoCommit>

您可以完全禁用自动提交，然后在发布所有文档后调用提交。否则，您可以按如下方式调整数字：

默认 maxTime 是 15 秒，因此如果有未提交的文档，自动提交每 15 秒发生一次，因此您可以将其设置为较大的值，例如 3 小时（即 3*60*60*1000）。您还可以添加<maxDocs>50000000</maxDocs>，这意味着仅在添加 5000 万个文档后才会发生自动提交。在您发布所有文档后，手动或从 SolrJ 调用一次 commit - 提交需要一段时间，但总体上会快得多。

此外，在您完成批量导入后，减少 maxTime 和 maxDocs，以便您对 Solr 执行的任何增量帖子将更快地提交。或者使用 solrconfig 中提到的commitWithin。

【讨论】：

如果完全禁用提交，您可能会耗尽内存。但不要重新打开搜索器是个好主意。
您好，您能告诉我如何配置它以使其不会重新打开搜索器吗？
<openSearcher>false</openSearcher> 不会在自动提交发生后打开新的搜索器。
但是如果你使用 <str name="replicateAfter">commit</str> 的复制，那么值得一提的是，slave 并不关心是否在 master 上打开了一个新的 searcher。如果 master 上有自动提交的索引，则 master 和 slave 上的索引版本将不同，因此 slave 将从 master 复制（部分）索引并打开一个新的搜索器。如果您在 master 上使用带有 clean=true 的数据导入处理程序进行完整导入，这尤其会造成麻烦，因为这首先会发出“全部删除”查询。

【解决方案2】：

除了上面写的，在使用 SolrCloud 时，您可能需要考虑在使用 SolrJ 时使用CloudSolrClient。 CloudSolrClient 客户端类是 Zookeeper 感知的，并且能够直接连接到领导分片，在某些情况下加速索引。

【讨论】：