Sphinx学习笔记（一）

时间 2019-11-29

标签 sphinx 学习笔记栏目 MySQL 繁體版

原文原文链接

最近负责一个项目，须要用到全文检索，个人环境大致以下：node

一、数据保存在MySQL中

二、须要支持中文检索

三、尽量的简单

选择了Sphinx，至于solr和Elasticsearch，看主页的介绍，它们对分布式、均衡等方面的支持很是好，只不过它们的安装包太大了，用起来挺不方便的，因此才放弃了它们，不过等有机会还能够研究一下。

基本步骤以下：

一、安装：Sphinx的主页是http://sphinxsearch.com/，目前版本为2.2.8，下载界面为http://sphinxsearch.com/downloads/release/，分为32位和64位版本，还分为windows、debian/ubuntu，Fedora/Centos版本，也能够直接下载源代码，进行编译安装，我主要在windows上测试，在Centos上部署，简述过程以下

1）Windows 8.1 X64 ， sphinx 2.2.8 (Win64 binaries w/MySQL+PgSQL+libstemmer+id64 support)

将压缩包解压缩到d:\blue下，解压缩后sphinx根目录为D:\blue\sphinx-2.2.8-release-win64-full。

修改配置文件sphinx-min.conf.in，相对简单一下

#

# Minimal Sphinx configuration sample (clean, simple, functional)

#

#数据源，src1为名字，后面会引用这个名字

source src1

{

type = mysql

sql_host = localhost

sql_user = test

sql_pass =

sql_db = test

sql_port = 3306 # optional, default is 3306

sql_query = \

SELECT id, group_id, UNIX_TIMESTAMP(date_added) AS date_added, title, content \

FROM documents

sql_attr_uint = group_id

sql_attr_timestamp = date_added

}

#test1为索引名称，sphinx检索时须要这个名字，至关于关系数据库中的table

index test1

{

source = src1 #引用的数据源名称

path = @CONFDIR@/data/test1

}

index testrt

{

type = rt

rt_mem_limit = 128M

path = @CONFDIR@/data/testrt

rt_field = title

rt_field = content

rt_attr_uint = gid

}

indexer

{

mem_limit = 128M

}

searchd

{

listen = 9312

listen = 9306:mysql41

log = @CONFDIR@/log/searchd.log

query_log = @CONFDIR@/log/query.log

read_timeout = 5

max_children = 30

pid_file = @CONFDIR@/log/searchd.pid

seamless_rotate = 1

preopen_indexes = 1

unlink_old = 1

workers = threads # for RT to work

binlog_path = @CONFDIR@/data

}

具体修改步骤以下，修改source src1下的mysql链接信息，包括主机、用户名、密码、数据库、端口，sql_query是数据源中的sql，这里是从mysql中抽取数据，sql_attr_*是用来分组排序用的，若是咱们须要对一些字段进行排序操做，须要在这里定义，另外须要替换@CONFDIR@为你想要的目录，个人修改以下

# Minimal Sphinx configuration sample (clean, simple, functional)

source src1

{

type = mysql

sql_host = localhost

sql_user = root

sql_pass =

sql_db = sphinx

sql_port = 3306 # optional, default is 3306

sql_query = \

SELECT id, group_id, UNIX_TIMESTAMP(date_added) AS date_added, title, content \

FROM documents

sql_attr_uint = group_id

sql_attr_timestamp = date_added

sql_query_pre = SET NAMES utf8

}

index test1

{

source = src1

path = D:/blue/sphinx_data/data/test1

ngram_len = 1

ngram_chars = U+4E00..U+9FBB, U+3400..U+4DB5, U+20000..U+2A6D6, U+FA0E, U+FA0F, U+FA11, U+FA13, U+FA14, U+FA1F, U+FA21, U+FA23, U+FA24, U+FA27, U+FA28, U+FA29, U+3105..U+312C, U+31A0..U+31B7, U+3041, U+3043, U+3045, U+3047, U+3049, U+304B, U+304D, U+304F, U+3051, U+3053, U+3055, U+3057, U+3059, U+305B, U+305D, U+305F, U+3061, U+3063, U+3066, U+3068, U+306A..U+306F, U+3072, U+3075, U+3078, U+307B, U+307E..U+3083, U+3085, U+3087, U+3089..U+308E, U+3090..U+3093, U+30A1, U+30A3, U+30A5, U+30A7, U+30A9, U+30AD, U+30AF, U+30B3, U+30B5, U+30BB, U+30BD, U+30BF, U+30C1, U+30C3, U+30C4, U+30C6, U+30CA, U+30CB, U+30CD, U+30CE, U+30DE, U+30DF, U+30E1, U+30E2, U+30E3, U+30E5, U+30E7, U+30EE, U+30F0..U+30F3, U+30F5, U+30F6, U+31F0, U+31F1, U+31F2, U+31F3, U+31F4, U+31F5, U+31F6, U+31F7, U+31F8, U+31F9, U+31FA, U+31FB, U+31FC, U+31FD, U+31FE, U+31FF, U+AC00..U+D7A3, U+1100..U+1159, U+1161..U+11A2, U+11A8..U+11F9, U+A000..U+A48C, U+A492..U+A4C6

}

index testrt

{

type = rt

rt_mem_limit = 128M

path = D:/blue/sphinx_data/data/testrt

rt_field = title

rt_field = content

rt_attr_uint = gid

}

indexer

{

mem_limit = 128M

}

searchd

{

listen = 9312

listen = 9306:mysql41

log = D:/blue/sphinx_data/log/searchd.log

query_log = D:/blue/sphinx_data/log/query.log

read_timeout = 5

max_children = 30

pid_file = D:/blue/sphinx_data/log/searchd.pid

seamless_rotate = 1

preopen_indexes = 1

unlink_old = 1

workers = threads # for RT to work

binlog_path = D:/blue/sphinx_data/data

}

修改的内容如黑体字所示，须要注意的是sql_query_pre, ngram_len,ngram_chars，这些都是支持中文检索必须的，若是没有的话，没法支持中文，另外将@CONFDIR@替换为d:\blue\sphinx_data，另外这个目录下创建两个目录data和log，不知道什么缘由，系统没法自动建立这两个目录，会出错。

另外在本机新建一个sphinx数据库，字符集选择utf-8，而后运行D:\blue\sphinx-2.2.8-release-win64-full下的example.sql，须要注意将其中的数据库前缀test.更换为sphinx.，表示在sphinx数据库中建立表，建立以后，在sphinx下检查一下，看是否存在documents和tags两张表。

而后在D:\blue\sphinx-2.2.8-release-win64-full\bin下，运行indexer -c ..\sphinx-min.conf.in --all ，以下

D:\blue\sphinx-2.2.8-release-win64-full\bin>indexer -c ..\sphinx-min.conf.in --all

Sphinx 2.2.8-id64-release (r4942)

Copyright (c) 2001-2015, Andrew Aksyonoff

Copyright (c) 2008-2015, Sphinx Technologies Inc (http://sphinxsearch.com)

using config file '..\sphinx-min.conf.in'...

indexing index 'test1'...

collected 4 docs, 0.0 MB

sorted 0.0 Mhits, 100.0% done

total 4 docs, 33882 bytes

total 0.121 sec, 278900 bytes/sec, 32.92 docs/sec

skipping non-plain index 'testrt'...

total 3 reads, 0.000 sec, 12.0 kb/call avg, 0.0 msec/call avg

total 12 writes, 0.001 sec, 5.7 kb/call avg, 0.1 msec/call avg

须要注意的是，若是须要创建的索引已经被使用，即已经启动了searchd服务，就须要增长--rotate参数，相似于

indexer -c ..\sphinx-min.conf.in --all --rotate

而后在同一目录下运行 searchd -c ..\sphinx-min.conf.in，以下mysql

D:\blue\sphinx-2.2.8-release-win64-full\bin>searchd -c ..\sphinx-min.conf.in

Sphinx 2.2.8-id64-release (r4942)

Copyright (c) 2001-2015, Andrew Aksyonoff

Copyright (c) 2008-2015, Sphinx Technologies Inc (http://sphinxsearch.com)

using config file '..\sphinx-min.conf.in'...

listening on all interfaces, port=9312

listening on all interfaces, port=9306

precaching index 'test1'

rotating index 'test1': success

precaching index 'testrt'

precached 2 indexes in 0.045 sec

没有什么错误，须要注意的是，须要先建立索引，才能启动服务，不然可能会出错，searchd命令也能够安装为服务，之后使用起来会更加方便，这里这么作也是为了看究竟是否配置成功，不然系统服务出错，咱们看不到错误缘由。linux

查看searchd的输出或者sphinx-min.conf.in的searchd的配置项，能够知道sphinx在两个端口监听，9312，9306，其中9312是Sphinx API访问的端口，9306是SphinxQL的，SphinxQL是一个Mysql接口，能够经过mysql客户端访问。

二、SphinxQL

SPihinxQL是一种mysql接口，能够经过sql语句来执行查询，能够用mysql命令行工具，也可使用mysql的客户端工具，如HeidiSQL，这个是我经常使用的mysql客户端，配置链接很简单，用户名密码不用填，只要设置主机和端口就能够了，端口一般为9306，命令行以下

>mysql -h localhost -P9306

下面就可使用SphinxQL了

mysql中的数据以下

id group_id group_id2 date_added title content

1 1 5 2015/3/27 16:53 test one this is my test document number one. also checking...

2 1 6 2015/3/27 16:53 test two this is my test document number two

3 2 7 2015/3/27 16:53 another doc this is another group

4 2 8 2015/3/27 16:53 doc number four this is to test groups

运行SphinxQL，

mysql> select * from test1 where match('my');

+------+----------+------------+

| id | group_id | date_added |

+------+----------+------------+

| 1 | 1 | 1427446411 |

| 2 | 1 | 1427446411 |

+------+----------+------------+

2 rows in set (0.00 sec)

能够看出这里面并不包含数据，只包含数字字段:id和group_id，因此若是想获得数据，须要在mysql中从新查询数据才能获得结果。

下面修改一下数据，改为中文，以下

id group_id group_id2 date_added title content

1 1 5 2015/3/27 16:53 test one this is my test document number one. also checking...

2 1 6 2015/3/27 16:53 test two this is my test document number two

3 2 7 2015/3/27 16:53 another doc 代码到了必定时间，必须重构，不然会出现问题

4 2 8 2015/3/27 16:53 doc number four 重庆制造到了最后阶段了，车体构造已经完成，就等待最后的出厂了

从新生成索引，

D:\blue\sphinx-2.2.8-release-win64-full\bin>indexer -c ..\sphinx-min.conf.in --all --rotate

Sphinx 2.2.8-id64-release (r4942)git

using config file '..\sphinx-min.conf.in'...数据库

indexing index 'test1'...npm

collected 4 docs, 0.0 MBubuntu

sorted 0.0 Mhits, 100.0% donewindows

total 4 docs, 303 bytes

total 0.086 sec, 3518 bytes/sec, 46.44 docs/sec

skipping non-plain index 'testrt'...

total 3 reads, 0.000 sec, 0.4 kb/call avg, 0.0 msec/call avg

total 12 writes, 0.001 sec, 0.2 kb/call avg, 0.0 msec/call avg

rotating indices: successfully sent SIGHUP to searchd (pid=4556).

中文查询就没法在mysql命令行中执行了，这是在windows的状况下，由于其中文字符不是UTF-8，会出现没法搜索出结果的现象，须要用HeidiSQL之类的，运行查询，

select * from test1 where match('重构');

"id" "group_id" "date_added"

"3" "2" "1427446411"

"4" "2" "1427446411"

这里面有一个问题，能够看出id 4实际上并无“重构”这个词，只是包含“重”“构”这两个字而已，因此可能没法知足某些需求，可是好在Sphinx的默认匹配方式是短语类似度，因此理论上来讲，包含“重构”这个词的会排序在前面，简单测试也是如此，是否一直如此就不知道了。能够参考这篇文章：http://rainkid.blog.163.com/blog/static/165140840201010277223611/

三、Nodejs查询Sphinx

1）Sphinxapi

首页在https://github.com/lindory-project/node-sphinxapi/tree/master，安装方式： npm install sphinxapi

文档比较详细，简单实用以下

#sphinx2.js

var SphinxClient = require ("sphinxapi"),

util = require('util'),

assert = require('assert');

var cl = new SphinxClient();

cl.SetServer('localhost', 9312);

cl.Query('重构','test1', function(err, result) {

assert.ifError(err);

console.log(util.inspect(result, false, null, true));

});

运行程序，node sphinx2.js，以下

{ error: '',

warning: '',

status: [ 0 ],

fields: [ 'title', 'content' ],

attrs:

[ [ 'group_id', 1 ],

[ 'date_added', 2 ] ],

matches:

[ { id: 3,

weight: 2,

attrs: { group_id: 2, date_added: 1427446411 } },

{ id: 4,

weight: 1,

attrs: { group_id: 2, date_added: 1427446411 } } ],

total: 2,

total_found: 2,

time: 0.004,

words:

[ { word: '重', docs: 2, hits: 2 },

{ word: '构', docs: 2, hits: 2 } ] }

能够看出和SphinxQL运行的效果同样，只不过返回的信息更多而已。

2）SphinxQL

SphinxQL须要SphinxAPI的支持，因此在安装sphinxapi包的基础上，还须要安装node-mysql包，命令为npm install mysql

简单例子以下

#sphinx.js

var mysql = require('mysql');

var connection = mysql.createConnection(

{

host : 'localhost',

port : '9306'

}

);

connection.connect();

var queryString = "SELECT * FROM test1 WHERE MATCH('重构')";

connection.query(queryString, function(err, rows, fields) {

if (err) throw err;

for (var i in rows) {

console.log(JSON.stringify(rows[i]));

}

});

connection.end();

运行程序，node sphinx.js，以下

{"id":3,"group_id":2,"date_added":1427446411}

{"id":4,"group_id":2,"date_added":1427446411}

乍看起来，彷佛sphinxapi提供的信息更多，我没有具体比较过，不过sphinxQL也包含了一些函数，如weight()，能够返回权重，如执行SELECT *, weight() FROM test1 WHERE MATCH('重构'); 结果以下

"id" "group_id" "date_added" "weight()"

"3" "2" "1427446411" "2557"

"4" "2" "1427446411" "1557"

可知sphinxap提供的权重，彷佛是sphinxQL提供的值除以1000以后的值

三、CentOS的安装和使用

CentOS的使用没什么特别的，最好是下载rpm安装包，过程以下

$ yum install postgresql-libs unixODBC

$ rpm -Uhv sphinx-2.2.8.rhel6.x86_64.rpm

$ service searchd start

具体的使用和Windows是同样的，没有什么区别。

四、其余

1）最好的文档来源是官方文档，比较详细，内容也较多

2）若是对信息的实时性要求较高，可使用实时索引，具体内容我没有仔细研究过，之后有机会研究吧

3）索引合并，若是原数据较多，新增长的数据很少，能够采用增量更新索引的办法，命令以下

indexer --merge DSTINDEX SRCINDEX [--rotate]

srcindex会更新到dstindex上，若是目标索引正在使用，须要使用--rotate参数

须要注意的是，若是发生重复现象，原始索引的数据并不会删除，若是要达到这一目的，能够运行

indexer --merge main delta --merge-dst-range deleted 0 0

在某些状况下，这种方式较为有用，如每隔一小时合并一次索引，晚上重建一次索引，若是数据规模过大，就须要考虑分布式了，这个问题就比较复杂了，须要另外研究了。

4）sql_query_pre = SET NAMES utf8

这个设置有些奇怪，我在文档中并无查到这个信息，，可是不设置这个，就没法生成中文索引，后来仍是在sphinx群中咨询了一下，“熊熊熊熊”同窗看了个人配置文件，指出了这个问题，我才得以继续使用sphinx，不然我都要放弃sphinx了，这里要对“熊熊熊熊”同窗表示感谢。不知道是否是windows 8的缘由，不过在linux下也须要设置这个参数，不知道为何。

5）安装为服务（windows 8. 1）

RPM和DEB包自动安装服务，在windows下，须要运行seachd命令将其安装为服务：

　　 searchd --install -c D:\blue\sphinx-for-chinese-2.2.1-dev-r4311-win32\sphinf-min.conf.in 服务名

若是不指定服务名，会在windows服务列表中生成一个名为search的服务。

在测试时，最好用searchd命令行运行，不要安装为服务，由于没有输出会比较麻烦，出了问题很差解决。

删除服务： sc delete 服务名

五、sphinx for chinese的配置问题（windows 8.1）

sphinx for chinese的版本有点旧了,最新的以下

2013.11.09 sphinx-for-chinese-2.2.1-dev-r4311-win32.zip

2013.11.09 sphinx-for-chinese-2.2.1-dev-r4311.tar.gz

其使用方式也比较简单，须要在配置文件中修改索引项，以下

index test1

{

source = src1

path = D:/blue/sphinx_data/data/test1

docinfo = extern

charset_type = utf-8

chinese_dictionary = D:\blue\sphinx-for-chinese-2.2.1-dev-r4311-win32\xdict

}

其中charset_type = utf-8在最新的版本中已经废弃，由于默认已是utf-8，xdict是一个字典文件

xdict是一个字典文件，能够从https://sphinx-for-chinese.googlecode.com/files/xdict_1.1.tar.gz下载，而后解压缩，用mkdict命令生成字典，以下：bin\mkdict.exe xdict_1.1.txt xdict

Sphinx for chinese用起来也很方便，可是有一个不同的地方，仍是使用上面的数据，运行SphinxQL，

select *,weight() from test1 where match('重构');

"id" "group_id" "date_added" "weight()"

"3" "2" "1427446411" "1695"

能够看出此次能够精确找到id为3的数据，没有搜索到4，可是若是只搜索单字“重”，是没有数据的，这是不同的地方。