lucene也用了挺长时间了,接触了不少它内在的东西,也做了一些优化,回头细看,还是万变不离官方方法优化,在此贴出官方优化方案,一起分享,呵呵

原文url:

http://wiki.apache.org/lucene-java/ImproveSearchingSpeed

http://wiki.apache.org/lucene-java/ImproveIndexingSpeed

 

 

Here are some things to try to speed up the seaching speed of your Lucene application. Please see ImproveIndexingSpeed for how to speed up indexing.

  • Be sure you really need to speed things up. 

  • Make sure you are using the latest version of Lucene.

  • Use a local filesystem. 

  • Get faster hardware, especially a faster IO system. 

  • Tune the OS

    One tunable that stands out on Linux is swappiness (http://kerneltrap.org/node/3000), which controls how aggressively the OS will swap out RAM used by processes in favor of the IO Cache. Most Linux distros default this to a highish number (meaning, aggressive) but this can easily cause horrible search latency, especially if you are searching a large index with a low query rate. Experiment by turning swappiness down or off entirely (by setting it to 0). Windows also has a checkbox, under My Computer -> Properties -> Advanced -> Performance Settings -> Advanced -> Memory Usage, that lets you favor Programs or System Cache, that's likely doing something similar.

  • Open the IndexReader with readOnly=true. 

  • On non-Windows platform, using NIOFSDirectory instead of FSDirectory.

    This also removes sources of contention when accessing the underlying files. Unfortunately, due to a longstanding bug on Windows in Sun's JRE (http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6265734 -- feel particularly free to go vote for it), NIOFSDirectory gets poor performance on Windows.

  • Add RAM to your hardware and/or increase the heap size for the JVM. 

  • Use one instance of IndexSearcher.

    Share a single IndexSearcher across queries and across threads in your application.

  • When measuring performance, disregard the first query.

    The first query to a searcher pays the price of initializing caches (especially when sorting by fields) and thus will skew your results (assuming you re-use the searcher for many queries). On the other hand, if you re-run the same query again and again, results won't be realistic either, because the operating system will use its cache to speed up IO operations. On Linux (kernel 2.6.16 and later) you can clean the disk cache using sync ; echo 3 > /proc/sys/vm/drop_caches. See http://linux-mm.org/Drop_Caches for details.

  • Re-open the IndexSearcher only when necessary.

    You must re-open the IndexSearcher in order to make newly committed changes visible to searching. However, re-opening the searcher has a certain overhead (noticeable mostly with large indexes and with sorting turned on) and should thus be minimized. Consider using a so called warming technique which allows the searcher to warm up its caches before the first query hits.

  • Decrease mergeFactor. 

  • Limit usage of stored fields and term vectors. 

  • Use FieldSelector to carefully pick which fields are loaded, and how they are loaded, when you retrieve a document.

  • Don't iterate over more hits than needed.

    Iterating over all hits is slow for two reasons. Firstly, the search() method that returns a Hits object re-executes the search internally when you need more than 100 hits. Solution: use the search method that takes a HitCollector instead. Secondly, the hits will probably be spread over the disk so accessing them all requires much I/O activity. This cannot easily be avoided unless the index is small enough to be loaded into RAM. If you don't need the complete documents but only one (small) field you could also use the FieldCache class to cache that one field and have fast access to it.

  • When using fuzzy queries use a minimum prefix length.

    Fuzzy queries perform CPU-intensive string comparisons - avoid comparing all unique terms with the user input by only examining terms starting with the first "N" characters. This prefix length is a property on both QueryParser and FuzzyQuery - default is zero so ALL terms are compared.

  • Consider using filters. 

  • Find the bottleneck.

    Complex query analysis or heavy post-processing of results are examples of hidden bottlenecks for searches. Profiling with at tool such as VisualVM helps locating the problem.

 

 

 

How to make indexing faster

Here are some things to try to speed up the indexing speed of your 

  • Be sure you really need to speed things up. 

  • Make sure you are using the latest version of Lucene.

  • Use a local filesystem. 

  • Get faster hardware, especially a faster IO system. 

  • Open a single writer and re-use it for the duration of your indexing session.

  • Flush by RAM usage instead of document count.

    For Lucene <= 2.2: call writer.ramSizeInBytes() after every added doc then call flush() when it's using too much RAM. This is especially good if you have small docs or highly variable doc sizes. You need to first set maxBufferedDocs large enough to prevent the writer from flushing based on document count. However, don't set it too large otherwise you may hit LUCENE-845. Somewhere around 2-3X your "typical" flush count should be OK.

    For Lucene >= 2.3IndexWriter can flush according to RAM usage itself. Call writer.setRAMBufferSizeMB() to set the buffer size. Be sure you don't also have any leftover calls to setMaxBufferedDocs since the writer will flush "either or" (whichever comes first).

  • Use as much RAM as you can afford.

    More RAM before flushing means Lucene writes larger segments to begin with which means less merging later. Testing in LUCENE-843 found that around 48 MB is the sweet spot for that content set, but, your application could have a different sweet spot.

  • Turn off compound file format.

    Call setUseCompoundFile(false). Building the compound file format takes time during indexing (7-33% in testing for LUCENE-888). However, note that doing this will greatly increase the number of file descriptors used by indexing and by searching, so you could run out of file descriptors if mergeFactor is also large.

  • Re-use Document and Field instances 

    Note that you cannot re-use a single Field instance within a Document, and, you should not change a Field's value until the Document containing that Field has been added to the index. See Field for details.

  • Always add fields in the same order to your Document, when using stored fields or term vectors

    Lucene's merging has an optimization whereby stored fields and term vectors can be bulk-byte-copied, but the optimization only applies if the field name -> number mapping is the same across segments. Future Lucene versions may attempt to assign the same mapping automatically (see LUCENE-1737), but until then the only way to get the same mapping is to always add the same fields in the same order to each document you index.

  • Re-use a single Token instance in your analyzer 

  • Use the char[] API in Token instead of the String API to represent token Text

    As of Lucene 2.3, a Token can represent its text as a slice into a char array, which saves the GC cost of new'ing and then reclaiming String instances. By re-using a single Token instance and using the char[] API you can avoid new'ing any objects for each term. See Token for details.

  • Use autoCommit=false when you open your IndexWriter

    In Lucene 2.3 there are substantial optimizations for Documents that use stored fields and term vectors, to save merging of these very large index files. You should see the best gains by using autoCommit=false for a single long-running session of IndexWriter. Note however that searchers will not see any of the changes flushed by thisIndexWriter until it is closed; if that is important you should stick with autoCommit=true instead or periodically close and re-open the writer.

  • Instead of indexing many small text fields, aggregate the text into a single "contents" field and index only that (you can still store the other fields).

  • Increase mergeFactor, but not too much.

    Larger mergeFactors defers merging of segments until later, thus speeding up indexing because merging is a large part of indexing. However, this will slow down searching, and, you will run out of file descriptors if you make it too large. Values that are too large may even slow down indexing since merging more segments at once means much more seeking for the hard drives.

  • Turn off any features you are not in fact using. 

  • Use a faster analyzer.

    Sometimes analysis of a document takes alot of time. For example, StandardAnalyzer is quite time consuming, especially in Lucene version <= 2.2. If you can get by with a simpler analyzer, then try it.

  • Speed up document construction. 

  • Don't optimize... ever.

  • Use multiple threads with one IndexWriter. 

  • Index into separate indices then merge. 

  • Run a Java profiler.

    If all else fails, profile your application to figure out where the time is going. I've had success with a very simple profiler called JMP. There are many others. Often you will be pleasantly surprised to find some silly, unexpected method is taking far too much time.

 

相关文章:

  • 2022-12-23
  • 2021-12-09
  • 2021-04-08
  • 2021-12-23
  • 2021-12-13
  • 2021-06-05
  • 2022-02-01
猜你喜欢
  • 2023-04-09
  • 2022-01-23
  • 2021-10-04
  • 2021-06-07
  • 2021-11-22
  • 2021-09-18
相关资源
相似解决方案