http://sinykk.iteye.com/php
官方网站html
http://wiki.apache.org/solrjava
http://wiki.apache.org/solr/DataImportHandlernode
本文档以solr3.4 tomcat6.3 IKAnalyzer3.2.5Stable为例mysql
Solr是一个高性能,采用Java5开发,基于Lucene的全文搜索服务器。同时对其进行了扩展,提供了比Lucene更为丰富的查询语言,同时实现了可配置、可扩展并对查询性能进行了优化,而且提供了一个完善的功能管理界面,是一款很是优秀的全文搜索引擎。linux
文档经过Http利用XML 加到一个搜索集合中。查询该集合也是经过http收到一个XML/JSON响应来实现。它的主要特性包括:高效、灵活的缓存功能,垂直搜索功能,git
一、 你搜索数据库数据时你的主键不是整形的,多是UUIDweb
二、 搜索任何文本类文档,甚至包括RSS,EMAIL等sql
经过在WINDOWS或LINUX服务器安装SOLR服务器,并配置上相应的索引规则,经过JAVA或PHP等脚本语言进行调用和查询数据库
下载软件为 :Tomcat与Solr,jdk1.6,官网均可免费下载。
添加编码的配置 URIEncoding="UTF-8" (如不添加,中文检索时由于乱码搜索不到)。
添加后为:
<Connector port="8983" protocol="HTTP/1.1" connectionTimeout="20000"
redirectPort="8443" URIEncoding="UTF-8" />
5. 创建d:/solr/home主目录(能够根据本身的状况创建),把D:\solr\apache-solr-3.3.0\example\solr复制到该目录下。
6. 创建solr.home 环境变量:置为 d:/solr/home
7. 将solr.War复制到tomcat的webapp下启动是会自动解压。
8. 修改D:\resouce\java\tomcat\webapps\solr\WEB-INF\web.xml.
<env-entry>
<env-entry-name>solr/home</env-entry-name>
<env-entry-value>d:\solr\home</env-entry-value>
<env-entry-type>java.lang.String</env-entry-type>
</env-entry>
9. 启动tomcat,浏览器输入:http://localhost:8080/solr/
10.看到页面说明部署成功
此linux安装版结合直接安装带有分词功能
一、将TOMCAT解压到 /usr/local/apache-tomcat-6.0.33/
二、将 /solr/apache-solr-3.3.0/example/solr 文件拷贝到 /usr/local/apache-tomcat-6.0.33/
三、而后修改TOMCAT的/usr/local/apache-tomcat-6.0.33/conf/server.xml【增长中文支持】
<Connector port="8983" protocol="HTTP/1.1"
connectionTimeout="20000"
redirectPort="8443" URIEncoding="UTF-8"/>
<Connector port="8983" protocol="HTTP/1.1"
connectionTimeout="20000"
redirectPort="8443" URIEncoding="UTF-8"/>
四、添加文件 /usr/local/apache-tomcat-6.0.33/conf/Catalina/localhost/solr.xml 内容以下
<?xml version="1.0" encoding="UTF-8"?>
<Context docBase="/usr/local/apache-tomcat-6.0.33/webapps/solr" debug="0" crossContext="true" >
<Environment name="solr/home" type="java.lang.String" value="/usr/local/apache-tomcat-6.0.33/solr" override="true" />
</Context>
<?xml version="1.0" encoding="UTF-8"?>
<Context docBase="/usr/local/apache-tomcat-6.0.33/webapps/solr" debug="0" crossContext="true" >
<Environment name="solr/home" type="java.lang.String" value="/usr/local/apache-tomcat-6.0.33/solr" override="true" />
</Context>
五、将/sinykk/solr/apache-solr-3.3.0/example/webapps/solr.war文件放到/usr/local/apache-tomcat-6.0.33/webapps文件夹下,并启动TOMCAT
六、将/sinykk/solr/IKAnalyzer3.2.8.jar 文件放到/usr/local/apache-tomcat-6.0.33/webapps/solr/WEB-INF/lib 目录下
七、修改/usr/local/apache-tomcat-6.0.33/solr/conf/schema.xml文件为
<?xml version="1.0" encoding="UTF-8" ?>
<schema name="example" version="1.4">
<types>
<fieldType name="string" class="solr.StrField" sortMissingLast="true" omitNorms="true"/>
<!--
<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
-->
<fieldType name="textik" class="solr.TextField" >
<analyzer class="org.wltea.analyzer.lucene.IKAnalyzer"/>
<analyzer type="index">
<tokenizer class="org.wltea.analyzer.solr.IKTokenizerFactory" isMaxWordLength="false"/>
<filter class="solr.StopFilterFactory"
ignoreCase="true" words="stopwords.txt"/>
<filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1"
generateNumberParts="1"
catenateWords="1"
catenateNumbers="1"
catenateAll="0"
splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPorterFilterFactory"
protected="protwords.txt"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="org.wltea.analyzer.solr.IKTokenizerFactory" isMaxWordLength="false"/>
<filter class="solr.StopFilterFactory"
ignoreCase="true" words="stopwords.txt"/>
<filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1"
generateNumberParts="1"
catenateWords="1"
catenateNumbers="1"
catenateAll="0"
splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPorterFilterFactory"
protected="protwords.txt"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldType>
</types>
<fields>
<field name="id" type="string" indexed="true" stored="true" required="true" />
</fields>
<uniqueKey>id</uniqueKey>
</schema>
<?xml version="1.0" encoding="UTF-8" ?>
<schema name="example" version="1.4">
<types>
<fieldType name="string" class="solr.StrField" sortMissingLast="true" omitNorms="true"/>
<!--
<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
-->
<fieldType name="textik" class="solr.TextField" >
<analyzer class="org.wltea.analyzer.lucene.IKAnalyzer"/>
<analyzer type="index">
<tokenizer class="org.wltea.analyzer.solr.IKTokenizerFactory" isMaxWordLength="false"/>
<filter class="solr.StopFilterFactory"
ignoreCase="true" words="stopwords.txt"/>
<filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1"
generateNumberParts="1"
catenateWords="1"
catenateNumbers="1"
catenateAll="0"
splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPorterFilterFactory"
protected="protwords.txt"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="org.wltea.analyzer.solr.IKTokenizerFactory" isMaxWordLength="false"/>
<filter class="solr.StopFilterFactory"
ignoreCase="true" words="stopwords.txt"/>
<filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1"
generateNumberParts="1"
catenateWords="1"
catenateNumbers="1"
catenateAll="0"
splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPorterFilterFactory"
protected="protwords.txt"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldType>
</types>
<fields>
<field name="id" type="string" indexed="true" stored="true" required="true" />
</fields>
<uniqueKey>id</uniqueKey>
</schema>
最后运行http://192.168.171.129:8983/solr/admin/analysis.jsp
solr 将MYSQL数据库作成索引数据源【注意格式】
参考:http://digitalpbk.com/apachesolr/apache-solr-mysql-sample-data-config
<requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler">
<lst name="defaults">
<str name="config">data-config.xml</str>
</lst>
</requestHandler>
<dataConfig>
<dataSource type="JdbcDataSource"
driver="com.mysql.jdbc.Driver"
url="jdbc:mysql://localhost/test"
user="root"
password=""/>
<document name="content">
<entity name="node" query="select id,name,title from solrdb">
<field column="nid" name="id" />
<field column="name" name="name" />
<field column="title" name="title" />
</entity>
</document>
</dataConfig>
<?xml version="1.0" encoding="UTF-8" ?>
<schema name="example" version="1.4">
<types>
<fieldType name="tint" class="solr.TrieIntField" precisionStep="8" omitNorms="true" positionIncrementGap="0"/>
<fieldType name="string" class="solr.StrField" sortMissingLast="true" omitNorms="true"/>
<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
<fieldType name="date" class="solr.TrieDateField" omitNorms="true" precisionStep="0" positionIncrementGap="0"/>
</types>
<fields>
<field name="id" type="string" indexed="true" stored="true" required="true" />
<field name="title" type="string" indexed="true" stored="true"/>
<field name="contents" type="text" indexed="true" stored="true"/>
</fields>
<uniqueKey>id</uniqueKey>
<defaultSearchField>contents</defaultSearchField>
<solrQueryParser defaultOperator="OR"/>
<copyField source="title" dest="contents"/>
</schema>
<?xml version="1.0" encoding="UTF-8" ?>
<schema name="example" version="1.4">
<types>
<fieldType name="tint" class="solr.TrieIntField" precisionStep="8" omitNorms="true" positionIncrementGap="0"/>
<fieldType name="string" class="solr.StrField" sortMissingLast="true" omitNorms="true"/>
<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
<fieldType name="date" class="solr.TrieDateField" omitNorms="true" precisionStep="0" positionIncrementGap="0"/>
</types>
<fields>
<field name="id" type="string" indexed="true" stored="true" required="true" />
<field name="title" type="string" indexed="true" stored="true"/>
<field name="contents" type="text" indexed="true" stored="true"/>
</fields>
<uniqueKey>id</uniqueKey>
<defaultSearchField>contents</defaultSearchField>
<solrQueryParser defaultOperator="OR"/>
<copyField source="title" dest="contents"/>
</schema>
schema.xml 里重要的字段
要有这个copyField字段SOLR才能检索多个字段的值【如下设置将同时搜索 title,name,contents中的值】
<defaultSearchField>contents</defaultSearchField>
copyField是用來複製你一個欄位裡的值到另外一欄位用. 如你能够將name裡的東西copy到default裡, 這樣solr作檢索時也會檢索到name裡的東西.
<copyField source="name" dest="contents"/>
<copyField source="title" dest="contents"/>
四、建立索引
http://192.168.171.129:8983/solr/dataimport?command=full-import
注:保证与数据库链接正确
参考:http://wiki.apache.org/solr/CoreAdmin
<solr persistent="true" sharedLib="lib">
<cores adminPath="/admin/cores">
<core name="core0" instanceDir="core0" dataDir="D:\solr\home\core0\data"/>
<core name="core1" instanceDir="core1" dataDir="D:\solr\home\core1\data" />
</cores>
</solr>
2、将D:\solr\apache-solr-3.3.0\example\multicore下的 core0,core1两个文件拷贝到D:\solr\home下,D:\solr\home目录下以前的任务目录及文件不变
注:D:\solr\home目录为D:\solr\apache-solr-3.3.0\example\solr
3、创建两个索引数据存放目录
D:\solr\home\core0\data
D:\solr\home\core1\data
4、修改其中一个索引如CORE1
修改solrconfig.xml为以下代码
【注 须要加入 lib 标签主要是由于DataImportHandler 为报错,这多是官方的BUG】
<?xml version="1.0" encoding="UTF-8" ?>
<config>
<luceneMatchVersion>LUCENE_33</luceneMatchVersion>
<directoryFactory name="DirectoryFactory" class="${solr.directoryFactory:solr.StandardDirectoryFactory}"/>
<lib dir="D:/solr/apache-solr-3.3.0/contrib/extraction/lib" />
<lib dir="D:/solr/apache-solr-3.3.0/dist/" regex="apache-solr-cell-\d.*\.jar" />
<lib dir="D:/solr/apache-solr-3.3.0/dist/" regex="apache-solr-clustering-\d.*\.jar" />
<lib dir="D:/solr/apache-solr-3.3.0/dist/" regex="apache-solr-dataimporthandler-\d.*\.jar" />
<lib dir="D:/solr/apache-solr-3.3.0/contrib/clustering/lib/" />
<lib dir="/total/crap/dir/ignored" />
<updateHandler class="solr.DirectUpdateHandler2" />
<requestDispatcher handleSelect="true" >
<requestParsers enableRemoteStreaming="false" multipartUploadLimitInKB="2048" />
</requestDispatcher>
<requestHandler name="standard" class="solr.StandardRequestHandler" default="true" />
<requestHandler name="/update" class="solr.XmlUpdateRequestHandler" />
<requestHandler name="/admin/" class="org.apache.solr.handler.admin.AdminHandlers" />
<admin>
<defaultQuery>solr</defaultQuery>
</admin>
<requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler">
<lst name="defaults">
<str name="config">data-config.xml</str>
</lst>
</requestHandler>
</config>
<?xml version="1.0" encoding="UTF-8" ?>
<config>
<luceneMatchVersion>LUCENE_33</luceneMatchVersion>
<directoryFactory name="DirectoryFactory" class="${solr.directoryFactory:solr.StandardDirectoryFactory}"/>
<lib dir="D:/solr/apache-solr-3.3.0/contrib/extraction/lib" />
<lib dir="D:/solr/apache-solr-3.3.0/dist/" regex="apache-solr-cell-\d.*\.jar" />
<lib dir="D:/solr/apache-solr-3.3.0/dist/" regex="apache-solr-clustering-\d.*\.jar" />
<lib dir="D:/solr/apache-solr-3.3.0/dist/" regex="apache-solr-dataimporthandler-\d.*\.jar" />
<lib dir="D:/solr/apache-solr-3.3.0/contrib/clustering/lib/" />
<lib dir="/total/crap/dir/ignored" />
<updateHandler class="solr.DirectUpdateHandler2" />
<requestDispatcher handleSelect="true" >
<requestParsers enableRemoteStreaming="false" multipartUploadLimitInKB="2048" />
</requestDispatcher>
<requestHandler name="standard" class="solr.StandardRequestHandler" default="true" />
<requestHandler name="/update" class="solr.XmlUpdateRequestHandler" />
<requestHandler name="/admin/" class="org.apache.solr.handler.admin.AdminHandlers" />
<admin>
<defaultQuery>solr</defaultQuery>
</admin>
<requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler">
<lst name="defaults">
<str name="config">data-config.xml</str>
</lst>
</requestHandler>
</config>
最后运行 http://localhost:8080/solr/core1/admin/
每次检索都检索至上次创建的索引基础上,因此当有新增数据时,不通过处理是没法检索到新增数据的。这时须要进行相关配置来实现实时检索
思路:设置两个数据源和两个索引,对不多更新或根本不更新的数据创建主索引,而对新增文档创建增量索引
主要是修改data-config.xml 数据源
<dataConfig>
<dataSource driver="com.mysql.jdbc.Driver" url="jdbc:mysql://localhost/demo" user="root" password=""/>
<document name="products">
<entity name="item" pk="id"
query="SELECT id,title,contents,last_index_time FROM solr_articles"
deltaImportQuery="SELECT id,title,contents,last_index_time FROM solr_articles
WHERE id = '${dataimporter.delta.id}'"
deltaQuery="SELECT id FROM solr_articles
WHERE last_index_time > '${dataimporter.last_index_time}'">
</entity>
</document>
</dataConfig>
注意数据库相关表的建立
如本例中 solr_articles表中有 last_index_time(timestamp)字段,每当增长或者更新了值都应修改last_index_time的值,以便增量索引能更新到
有问题请即时查看TOMCAT的LOG日志文件
运行:http://192.168.171.129:8983/solr/dataimport?command=delta-import
若是运行后未达到你的预期,请查看dataimport.properties文件的日期,并组合新SQL语句查询来调整问题
作好主索引和增量索引后就须要创建两个定时任务(linux crontab)
一个每五分钟的增量索引定时任务:每五分钟更新一次增量索引,同时合并主索引和增量索引以此保证能检索出五分钟之前的全部数据
一个天天凌晨两点的主索引更新,同时清除增量索引,以此来保证主索引的效率,同时减小数据的重复性
solr 分布式实际上是分发,这概念像Mysql的复制。全部的索引的改变都在主服务器里,全部的查询都在从服务里。从服务器不断地(定时)从主服务器拉内容,以保持数据一致。
参考:http://chenlb.blogjava.net/archive/2008/07/04/212398.html
要想搜索出的数据准确你能够经过如下几种方式来解决
一、 创建本身的分词库
二、 在对数据进行了更新,添加,删除时经过DOCUMENT来更新索引
三、 采用增量索引,进行定时更新
参考本文档的LINUX安装SOLR
使用PHP访问SOLR中的索引数据
参考:http://code.google.com/p/solr-php-client/
一个简单的例子:http://code.google.com/p/solr-php-client/wiki/ExampleUsage
注:与用C写的SPHINX搜索引擎类似