sphinx coreseek 使用安装的简单描述
1、简介
1.1.Sphinx是什么
Sphinx是由俄罗斯人Andrew Aksyonoff开发的一个全文检索引擎。意图为其他应用提供高速、低空间占用、高结果 相关度的全文搜索功能。Sphinx可以非常容易的与SQL数据库和脚本语言集成。当前系统内置MySQL和PostgreSQL 数据库数据源的支持,也支持从标准输入读取特定格式 的XML数据。通过修改源代码,用户可以自行增加新的数据源(例如:其他类型的DBMS 的原生支持)
1.2.Sphinx的特性
(1)、高速的建立索引(在当代CPU上,峰值性能可达到10 MB/秒);
(2)、高性能的搜索(在2 – 4GB 的文本数据上,平均每次检索响应时间小于0.1秒);
(3)、可处理海量数据(目前已知可以处理超过100 GB的文本数据, 在单一CPU的系统上可 处理100 M 文档);
(4)、提供了优秀的相关度算法,基于短语相似度和统计(BM25)的复合Ranking方法;
(5)、支持分布式搜索;
(6)、支持短语搜索
(7)、提供文档摘要生成
(8)、可作为MySQL的存储引擎提供搜索服务;
(9)、支持布尔、短语、词语相似度等多种检索模式;
(10)、文档支持多个全文检索字段(最大不超过32个);
(11)、文档支持多个额外的属性信息(例如:分组信息,时间戳等);
(12)、支持断词;
1.3.Sphinx中文分词
中文的全文检索和英文等latin系列不一样,后者是根据空格等特殊字符来断词,而中文是根据语义来分词。目前大多数数据库尚未支持中文全文检索,如Mysql。故,国内出现了一些Mysql的中文全文检索的插件,做的比较好的有hightman的中文分词。Sphinx如果需要对中文进行全文检索,也得需要一些插件来补充。其中我知道的插件有 coreseek 和 sfc 。
2、安装配置实例
2.1 在GNU/Linux/unix系统上安装
2.1.1 sphinx安装
下载包到/usr/local/src目录下。
tar zxvf sphinx-2.0.8-release.tar.gz
cd sphinx-2.0.8-release
./configure –prefix=/usr/local/sphinx #注意:这里sphinx已经默认支持了mysql
make && make install # 其中的“警告”可以忽略
tar xzvf coreseek-3.2.14.tar.gz
cd coreseek-3.2.14
安装mmseg中文分词
cd mmseg-3.2.14
./bootstrap #输出的warning信息可以忽视,若是呈现error则须要解决
./configure --prefix=/usr/local/mmseg3
make && make install
cd ..
##安装coreseek
cd csft-3.2.14
sh buildconf.sh #输出的warning信息可以忽视,若是呈现error则须要解决
./configure --prefix=/usr/local/coreseek --without-unixodbc --with-mmseg --with-mmseg-i ncludes=/usr/local/mmseg3/include/mmseg/ --with-mmseg-libs=/usr/local/mmseg3/lib/ --with-mysql ##若是提示mysql题目,可以查看MySQL数据源安装申明
make && make install
cd ..
生成和使用分词字典
bin/mmseg -u /usr/local/mmseg3/etc/unigram.txt
将生成unigram.txt.lib 文件
cp unigram.txt.lib /usr/local/mmseg3/etc/uni.lib
cd ../coreseek
[[email protected] coreseek]# mkdir dict ###创建字典目录
cp /usr/local/mmseg3/etc/unigram.txt.uni dict/uni.lib ###把创建的词典复制到dict
sudo vim dict/mmseg.ini ####创建mmseg的配置文件
mmseg.ini:
[mmseg]
merge_number_and_ascii=1;
number_and_ascii_joint=-;
compress_space=0;
seperate_number_ascii=1;
至此,mmseg配置完毕!下一步配置csft.conf——coreseek的配置文件
2.1.2 配置文件
cd /usr/local/sphinx/etc #进入sphinx的配置文件目录
cp sphinx.conf.dist sphinx.conf #新建Sphinx配置文件
sudo gedit sphinx.conf #编辑sphinx.conf
2.1.3 配置内容
(1)数据库的配置
(2)异常 #exceptions= /data/exceptions.txt 注释掉
(3) sql_query_pre= SET NAMES utf8去掉注释
(4)加一句: charset_dictpath = /usr/local/mmseg3/etc
(5) 变成:charset_type= zh_cn.utf-8
2.1.4 使用命令
#建立索引
/usr/local/coreseek/bin/indexer --all
/usr/local/coreseek/bin/search -c /usr/local/csft/etc/csft.conf 贾
3、 配置实例
设备MYSQL数据源
vi /usr/local/coreseek/etc/csft.conf
摘录我的MYSQL数据源设备文件
source src1
{
# data source type. mandatory, no default value
# known types are mysql, pgsql, mssql, xmlpipe, xmlpipe2, odbc
type = mysql
#####################################################################
## SQL settings (for 'mysql' and 'pgsql' types)
#####################################################################
# some straightforward parameters for SQL source types
sql_host = localhost
sql_user = root
sql_pass =
sql_db = test
sql_port = 3306 # optional, default is 3306
# UNIX socket name
# optional, default is empty (reuse client library defaults)
# usually '/var/lib/mysql/mysql.sock' on Linux
# usually '/tmp/mysql.sock' on FreeBSD
#
# sql_sock = /tmp/mysql.sock
# MySQL specific client connection flags
# optional, default is 0
#
# mysql_connect_flags = 32 # enable compression
# MySQL specific SSL certificate settings
# optional, defaults are empty
#
# mysql_ssl_cert = /etc/ssl/client-cert.pem
# mysql_ssl_key = /etc/ssl/client-key.pem
# mysql_ssl_ca = /etc/ssl/cacert.pem
# MS SQL specific Windows authentication mode flag
# MUST be in sync with charset_type index-level setting
# optional, default is 0
#
# mssql_winauth = 1 # use currently logged on user credentials
# MS SQL specific Unicode indexing flag
# optional, default is 0 (request SBCS data)
#
# mssql_unicode = 1 # request Unicode data from server
# ODBC specific DSN (data source name)
# mandatory for odbc source type, no default value
#
# odbc_dsn = DBQ=C:\data;DefaultDir=C:\data;Driver={Microsoft Text Driver (*.txt; *.csv)};
# sql_query = SELECT id, data FROM documents.csv
# pre-query, executed before the main fetch query
# multi-value, optional, default is empty list of queries
#
sql_query_pre = SET NAMES utf8
# sql_query_pre = SET SESSION query_cache_type=OFF
# main document fetch query
# mandatory, integer document ID field MUST be the first selected column
sql_query = \
SELECT goods_id,goods_name,goods_color\
FROM goods_test
# range query setup, query that must return min and max ID values
# optional, default is empty
#
# sql_query will need to reference $start and $end boundaries
# if using ranged query:
#
# sql_query = \
# SELECT doc.id, doc.id AS group, doc.title, doc.data \
# FROM documents doc \
# WHERE id>=$start AND id<=$end
#
# sql_query_range = SELECT MIN(id),MAX(id) FROM documents
# range query step
# optional, default is 1024
#
# sql_range_step = 1000
# unsigned integer attribute declaration
# multi-value (an arbitrary number of attributes is allowed), optional
# optional bit size can be specified, default is 32
#
# sql_attr_uint = author_id
# sql_attr_uint = forum_id:9 # 9 bits for forum_id
sql_attr_uint = group_id
# boolean attribute declaration
# multi-value (an arbitrary number of attributes is allowed), optional
# equivalent to sql_attr_uint with 1-bit size
#
# sql_attr_bool = is_deleted
# bigint attribute declaration
# multi-value (an arbitrary number of attributes is allowed), optional
# declares a signed (unlike uint!) 64-bit attribute
#
# sql_attr_bigint = my_bigint_id
# UNIX timestamp attribute declaration
# multi-value (an arbitrary number of attributes is allowed), optional
# similar to integer, but can also be used in date functions
#
# sql_attr_timestamp = posted_ts
# sql_attr_timestamp = last_edited_ts
sql_attr_timestamp = date_added
# string ordinal attribute declaration
# multi-value (an arbitrary number of attributes is allowed), optional
# sorts strings (bytewise), and stores their indexes in the sorted list
# sorting by this attr is equivalent to sorting by the original strings
#
# sql_attr_str2ordinal = author_name
# floating point attribute declaration
# multi-value (an arbitrary number of attributes is allowed), optional
# values are stored in single precision, 32-bit IEEE 754 format
#
# sql_attr_float = lat_radians
# sql_attr_float = long_radians
# multi-valued attribute (MVA) attribute declaration
# multi-value (an arbitrary number of attributes is allowed), optional
# MVA values are variable length lists of unsigned 32-bit integers
#
# syntax is ATTR-TYPE ATTR-NAME 'from' SOURCE-TYPE [;QUERY] [;RANGE-QUERY]
# ATTR-TYPE is 'uint' or 'timestamp'
# SOURCE-TYPE is 'field', 'query', or 'ranged-query'
# QUERY is SQL query used to fetch all ( docid, attrvalue ) pairs
# RANGE-QUERY is SQL query used to fetch min and max ID values, similar to 'sql_query_range'
#
# sql_attr_multi = uint tag from query; SELECT id, tag FROM tags
# sql_attr_multi = uint tag from ranged-query; \
# SELECT id, tag FROM tags WHERE id>=$start AND id<=$end; \
# SELECT MIN(id), MAX(id) FROM tags
# post-query, executed on sql_query completion
# optional, default is empty
#
# sql_query_post =
# post-index-query, executed on successful indexing completion
# optional, default is empty
# $maxid expands to max document ID actually fetched from DB
#
# sql_query_post_index = REPLACE INTO counters ( id, val ) \
# VALUES ( 'max_indexed_id', $maxid )
# ranged query throttling, in milliseconds
# optional, default is 0 which means no delay
# enforces given delay before each query step
sql_ranged_throttle = 0
# document info query, ONLY for CLI search (ie. testing and debugging)
# optional, default is empty
# must contain $id macro and must fetch the document by that id
sql_query_info = SELECT * FROM documents WHERE id=$id
# kill-list query, fetches the document IDs for kill-list
# k-list will suppress matches from preceding indexes in the same query
# optional, default is empty
#
# sql_query_killlist = SELECT id FROM documents WHERE edited>[email protected]_reindex
# columns to unpack on indexer side when indexing
# multi-value, optional, default is empty list
#
# unpack_zlib = zlib_column
# unpack_mysqlcompress = compressed_column
# unpack_mysqlcompress = compressed_column_2
# maximum unpacked length allowed in MySQL COMPRESS() unpacker
# optional, default is 16M
#
# unpack_mysqlcompress_maxsize = 16M
#####################################################################
## xmlpipe settings
#####################################################################
# type = xmlpipe
# shell command to invoke xmlpipe stream producer
# mandatory
#
# xmlpipe_command = cat /usr/local/coreseek/var/test.xml
#####################################################################
## xmlpipe2 settings
#####################################################################
# type = xmlpipe2
# xmlpipe_command = cat /usr/local/coreseek/var/test2.xml
# xmlpipe2 field declaration
# multi-value, optional, default is empty
#
# xmlpipe_field = subject
# xmlpipe_field = content
# xmlpipe2 attribute declaration
# multi-value, optional, default is empty
# all xmlpipe_attr_XXX options are fully similar to sql_attr_XXX
#
# xmlpipe_attr_timestamp = published
# xmlpipe_attr_uint = author_id
# perform UTF-8 validation, and filter out incorrect codes
# avoids XML parser choking on non-UTF-8 documents
# optional, default is 0
#
# xmlpipe_fixup_utf8 = 1
}
# inherited source example
#
# all the parameters are copied from the parent source,
# and may then be overridden in this source definition
source src1throttled : src1
{
sql_ranged_throttle = 100
}
#############################################################################
## index definition
#############################################################################
# local index example
#
# this is an index which is stored locally in the filesystem
#
# all indexing-time options (such as morphology and charsets)
# are configured per local index
index test1
{
# document source(s) to index
# multi-value, mandatory
# document IDs must be globally unique across all sources
source = src1
# index files path and file name, without extension
# mandatory, path must be writable, extensions will be auto-appended
path = /usr/local/coreseek/var/data/test1
# document attribute values (docinfo) storage mode
# optional, default is 'extern'
# known values are 'none', 'extern' and 'inline'
docinfo = extern
# memory locking for cached data (.spa and .spi), to prevent swapping
# optional, default is 0 (do not mlock)
# requires searchd to be run from root
mlock = 0
# a list of morphology preprocessors to apply
# optional, default is empty
#
# builtin preprocessors are 'none', 'stem_en', 'stem_ru', 'stem_enru',
# 'soundex', and 'metaphone'; additional preprocessors available from
# libstemmer are 'libstemmer_XXX', where XXX is algorithm code
# (see libstemmer_c/libstemmer/modules.txt)
#
# morphology = stem_en, stem_ru, soundex
# morphology = libstemmer_german
# morphology = libstemmer_sv
morphology = none
# minimum word length at which to enable stemming
# optional, default is 1 (stem everything)
#
# min_stemming_len = 1
# stopword files list (space separated)
# optional, default is empty
# contents are plain text, charset_table and stemming are both applied
#
#stopwords = G:\data\stopwords.txt
# wordforms file, in "mapfrom > mapto" plain text format
# optional, default is empty
#
#wordforms = G:\data\wordforms.txt
# tokenizing exceptions file
# optional, default is empty
#
# plain text, case sensitive, space insensitive in map-from part
# one "Map Several Words => ToASingleOne" entry per line
#
#exceptions = /data/exceptions.txt
# minimum indexed word length
# default is 1 (index everything)
min_word_len = 1
charset_dictpath = /usr/local/coreseek/dict/
# charset encoding type
# optional, default is 'sbcs'
# known types are 'sbcs' (Single Byte CharSet) and 'utf-8'
charset_type = zh_cn.utf-8
# charset definition and case folding rules "table"
# optional, default value depends on charset_type
#
# defaults are configured to include English and Russian characters only
# you need to change the table to include additional ones
# this behavior MAY change in future versions
#
# 'sbcs' default value is
# charset_table = 0..9, A..Z->a..z, _, a..z, U+A8->U+B8, U+B8, U+C0..U+DF->U+E0..U+FF, U+E0..U+FF
#
# 'utf-8' default value is
# charset_table = 0..9, A..Z->a..z, _, a..z, U+410..U+42F->U+430..U+44F, U+430..U+44F
# ignored characters list
# optional, default value is empty
#
# ignore_chars = U+00AD
# minimum word prefix length to index
# optional, default is 0 (do not index prefixes)
#
# min_prefix_len = 0
# minimum word infix length to index
# optional, default is 0 (do not index infixes)
#
# min_infix_len = 0
# list of fields to limit prefix/infix indexing to
# optional, default value is empty (index all fields in prefix/infix mode)
#
# prefix_fields = filename
# infix_fields = url, domain
# enable star-syntax (wildcards) when searching prefix/infix indexes
# known values are 0 and 1
# optional, default is 0 (do not use wildcard syntax)
#
# enable_star = 1
# n-gram length to index, for CJK indexing
# only supports 0 and 1 for now, other lengths to be implemented
# optional, default is 0 (disable n-grams)
#
# ngram_len = 1
# n-gram characters list, for CJK indexing
# optional, default is empty
#
# ngram_chars = U+3000..U+2FA1F
# phrase boundary characters list
# optional, default is empty
#
# phrase_boundary = ., ?, !, U+2026 # horizontal ellipsis
# phrase boundary word position increment
# optional, default is 0
#
# phrase_boundary_step = 100
# whether to strip HTML tags from incoming documents
# known values are 0 (do not strip) and 1 (do strip)
# optional, default is 0
html_strip = 0
# what HTML attributes to index if stripping HTML
# optional, default is empty (do not index anything)
#
# html_index_attrs = img=alt,title; a=title;
# what HTML elements contents to strip
# optional, default is empty (do not strip element contents)
#
# html_remove_elements = style, script
# whether to preopen index data files on startup
# optional, default is 0 (do not preopen), searchd-only
#
# preopen = 1
# whether to keep dictionary (.spi) on disk, or cache it in RAM
# optional, default is 0 (cache in RAM), searchd-only
#
# ondisk_dict = 1
# whether to enable in-place inversion (2x less disk, 90-95% speed)
# optional, default is 0 (use separate temporary files), indexer-only
#
# inplace_enable = 1
# in-place fine-tuning options
# optional, defaults are listed below
#
# inplace_hit_gap = 0 # preallocated hitlist gap size
# inplace_docinfo_gap = 0 # preallocated docinfo gap size
# inplace_reloc_factor = 0.1 # relocation buffer size within arena
# inplace_write_factor = 0.1 # write buffer size within arena
# whether to index original keywords along with stemmed versions
# enables "=exactform" operator to work
# optional, default is 0
#
# index_exact_words = 1
# position increment on overshort (less that min_word_len) words
# optional, allowed values are 0 and 1, default is 1
#
# overshort_step = 1
# position increment on stopword
# optional, allowed values are 0 and 1, default is 1
#
# stopword_step = 1
}
# inherited index example
#
# all the parameters are copied from the parent index,
# and may then be overridden in this index definition
index test1stemmed : test1
{
path = /usr/local/coreseek/var/data/test1stemmed
morphology = stem_en
}
# distributed index example
#
# this is a virtual index which can NOT be directly indexed,
# and only contains references to other local and/or remote indexes
index dist1
{
# 'distributed' index type MUST be specified
type = distributed
# local index to be searched
# there can be many local indexes configured
local = test1
local = test1stemmed
# remote agent
# multiple remote agents may be specified
# syntax for TCP connections is 'hostname:port:index1,[index2[,...]]'
# syntax for local UNIX connections is '/path/to/socket:index1,[index2[,...]]'
agent = localhost:9313:remote1
agent = localhost:9314:remote2,remote3
# agent = /var/run/searchd.sock:remote4
# blackhole remote agent, for debugging/testing
# network errors and search results will be ignored
#
# agent_blackhole = testbox:9312:testindex1,testindex2
# remote agent connection timeout, milliseconds
# optional, default is 1000 ms, ie. 1 sec
agent_connect_timeout = 1000
# remote agent query timeout, milliseconds
# optional, default is 3000 ms, ie. 3 sec
agent_query_timeout = 3000
}
#############################################################################
## indexer settings
#############################################################################
indexer
{
# memory limit, in bytes, kiloytes (16384K) or megabytes (256M)
# optional, default is 32M, max is 2047M, recommended is 256M to 1024M
mem_limit = 32M
# maximum IO calls per second (for I/O throttling)
# optional, default is 0 (unlimited)
#
# max_iops = 40
# maximum IO call size, bytes (for I/O throttling)
# optional, default is 0 (unlimited)
#
# max_iosize = 1048576
# maximum xmlpipe2 field length, bytes
# optional, default is 2M
#
# max_xmlpipe2_field = 4M
# write buffer size, bytes
# several (currently up to 4) buffers will be allocated
# write buffers are allocated in addition to mem_limit
# optional, default is 1M
#
# write_buffer = 1M
}
#############################################################################
## searchd settings
#############################################################################
searchd
{
# hostname, port, or hostname:port, or /unix/socket/path to listen on
# multi-value, multiple listen points are allowed
# optional, default is 0.0.0.0:9312 (listen on all interfaces, port 9312)
#
# listen = 127.0.0.1
# listen = 192.168.0.1:9312
# listen = 9312
# listen = /var/run/searchd.sock
# log file, searchd run info is logged here
# optional, default is 'searchd.log'
log = /usr/local/coreseek/var/log/searchd.log
# query log file, all search queries are logged here
# optional, default is empty (do not log queries)
query_log = /usr/local/coreseek/var/log/query.log
# client read timeout, seconds
# optional, default is 5
read_timeout = 5
# request timeout, seconds
# optional, default is 5 minutes
client_timeout = 300
# maximum amount of children to fork (concurrent searches to run)
# optional, default is 0 (unlimited)
max_children = 30
# PID file, searchd process ID file name
# mandatory
pid_file = /usr/local/coreseek/var/log/searchd.pid
# max amount of matches the daemon ever keeps in RAM, per-index
# WARNING, THERE'S ALSO PER-QUERY LIMIT, SEE SetLimits() API CALL
# default is 1000 (just like Google)
max_matches = 1000
# seamless rotate, prevents rotate stalls if precaching huge datasets
# optional, default is 1
seamless_rotate = 1
# whether to forcibly preopen all indexes on startup
# optional, default is 0 (do not preopen)
preopen_indexes = 0
# whether to unlink .old index copies on succesful rotation.
# optional, default is 1 (do unlink)
unlink_old = 1
# attribute updates periodic flush timeout, seconds
# updates will be automatically dumped to disk this frequently
# optional, default is 0 (disable periodic flush)
#
# attr_flush_period = 900
# instance-wide ondisk_dict defaults (per-index value take precedence)
# optional, default is 0 (precache all dictionaries in RAM)
#
# ondisk_dict_default = 1
# MVA updates pool size
# shared between all instances of searchd, disables attr flushes!
# optional, default size is 1M
mva_updates_pool = 1M
# max allowed network packet size
# limits both query packets from clients, and responses from agents
# optional, default size is 8M
max_packet_size = 8M
# crash log path
# searchd will (try to) log crashed query to 'crash_log_path.PID' file
# optional, default is empty (do not create crash logs)
#
# crash_log_path = /usr/local/coreseek/var/log/crash
# max allowed per-query filter count
# optional, default is 256
max_filters = 256
# max allowed per-filter values count
# optional, default is 4096
max_filter_values = 4096
# socket listen queue length
# optional, default is 5
#
# listen_backlog = 5
# per-keyword read buffer size
# optional, default is 256K
#
# read_buffer = 256K
# unhinted read size (currently used when reading hits)
# optional, default is 32K
#
# read_unhinted = 32K
}
a. source是设备数据源,遵守提示输入MYSQL的主机、帐号、暗码和数据库即可,我的MYSQL就安装在本机上(MYSQL的安装可自行百度)
b. sql_query_pre是在履行查询之前履行的SQL语句。(重视:在coreseek只能辨认utf8字符集编码,所以我们要履行转换一下)
c. sql_query是要查询进行索引的SQL语句,sql_attr_unit和sql_attr_timestamp是设置属性的,属性在全文检索中可以用来设置过滤和排序。
d. index和source应当是成对呈现,index就是设备索引的功能(我们还可以设备多个索引 主索引+增量索引的功能)
e. searchd是常驻过程的全文检索办事,默认监控本机的9312端口
f. charset_type和charset_dictpath是中文分词设备
4、应用
4.1 在CLI上测试
建立索引: /usr/local/coreseek/bin/indexer --all
搜索中文:/usr/local/coreseek/bin/search -c /usr/local/coreseek/etc/csft.conf '贾'
5、附录