Solr Schema.xml分析

时间 2019-12-05

标签 solr schema.xml schema xml 分析栏目 XML 繁體版

原文原文链接

Solr Schema.xml分析
java

1、字段配置（schema）数据库

schema.xml位于solr/conf/目录下，相似于数据表配置文件，apache

定义了加入索引的数据的数据类型，主要包括type、fields和其余的一些缺省设置。服务器

一、先来看下type节点，这里面定义FieldType子节点，包括name,class,positionIncrementGap等一些参数。网络

name：就是这个FieldType的名称。app

class：指向org.apache.solr.analysis包里面对应的class名称，用来定义这个类型的行为。
性能

<schema name="example" version="1.2">
  <types>
    <fieldType name="string" class="solr.StrField" sortMissingLast="true" omitNorms="true"/>
    <fieldType name="boolean" class="solr.BoolField" sortMissingLast="true" omitNorms="true"/>
    <fieldtype name="binary" class="solr.BinaryField"/>
    <fieldType name="int" class="solr.TrieIntField" precisionStep="0" omitNorms="true" 
                                                                positionIncrementGap="0"/>
    <fieldType name="float" class="solr.TrieFloatField" precisionStep="0" omitNorms="true" 
                                                                positionIncrementGap="0"/>
    <fieldType name="long" class="solr.TrieLongField" precisionStep="0" omitNorms="true" 
                                                                positionIncrementGap="0"/>
    <fieldType name="double" class="solr.TrieDoubleField" precisionStep="0" omitNorms="true" 
                                                                positionIncrementGap="0"/>
  ...  </types>
  ...</schema>

必要的时候fieldType还须要本身定义这个类型的数据在创建索引和进行查询的时候要使用的分析器analyzer，包括分词和过滤，以下
ui

<fieldType name="text_ws" class="solr.TextField" positionIncrementGap="100">
  <analyzer>
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
  </analyzer></fieldType><fieldType name="text" class="solr.TextField" positionIncrementGap="100">
  <analyzer type="index">
    <!--这个分词包是空格分词，在向索引库添加text类型的索引时，Solr会首先用空格进行分词
         而后把分词结果依次使用指定的过滤器进行过滤，最后剩下的结果，才会加入到索引库中以备查询。
      注意:Solr的analysis包并无带支持中文的包，须要本身添加中文分词器，google下。  
     -->
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <!-- in this example, we will only use synonyms at query time
        <filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" 
                                                  ignoreCase="true" expand="false"/>        -->
        <!-- Case insensitive stop word removal.
          add enablePositionIncrements=true in both the index and query
          analyzers to leave a 'gap' for more accurate phrase queries.        -->
      <filter class="solr.StopFilterFactory"
                ignoreCase="true"
                words="stopwords.txt"
                enablePositionIncrements="true"
                />
      <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" 
              generateNumberParts="1" catenateWords="1" catenateNumbers="1" 
              catenateAll="0" splitOnCaseChange="1"/>
      <filter class="solr.LowerCaseFilterFactory"/>
      <filter class="solr.SnowballPorterFilterFactory" language="English" 
                                                       protected="protwords.txt"/>
    </analyzer>
    <analyzer type="query">
      <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" 
                                                                          expand="true"/>
        <filter class="solr.StopFilterFactory"
                ignoreCase="true"
                words="stopwords.txt"
                enablePositionIncrements="true"
                />
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" 
                generateNumberParts="1" catenateWords="0" catenateNumbers="0" 
                                        catenateAll="0" splitOnCaseChange="1"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.SnowballPorterFilterFactory" language="English" 
                                                         protected="protwords.txt"/>
      </analyzer></fieldType>

二、再来看下fields节点内定义具体的字段（相似数据库的字段），含有如下属性：this

name：字段名google
type：以前定义过的各类FieldType
indexed：是否被索引
stored：是否被存储（若是不须要存储相应字段值，尽可能设为false）
multiValued：是否有多个值（对可能存在多值的字段尽可能设置为true，避免建索引时抛出错误）

_version节点和root节点是必须保留的，不能删除

<fields>
    <field name="id" type="integer" indexed="true" stored="true" required="true" />
    <field name="name" type="text" indexed="true" stored="true" />
    <field name="summary" type="text" indexed="true" stored="true" />
    <field name="author" type="string" indexed="true" stored="true" />
    <field name="date" type="date" indexed="false" stored="true" />
    <field name="content" type="text" indexed="true" stored="false" />
    <field name="keywords" type="keyword_text" indexed="true" stored="false" multiValued="true" />
    <!--拷贝字段-->
    <field name="all" type="text" indexed="true" stored="false" multiValued="true"/></fields>

　　三、建议创建一个拷贝字段，将全部的全文本字段复制到一个字段中，以便进行统一的检索：

如下是拷贝设置：

<copyField source="name" dest="all"/><copyField source="summary" dest="all"/>

四、动态字段，没有具体名称的字段，用dynamicField字段

如：name为*_i，定义它的type为int，那么在使用这个字段的时候，任务以_i结果的字段都被认为符合这个定义。如name_i, school_i

< dynamicField   name = "*_i"    type = "int"      indexed = "true"    stored = "true" />    < dynamicField   name = "*_s"    type = "string"    indexed = "true"    stored = "true" />

schema.xml文档注释中的信息：

一、为了改进性能，能够采起如下几种措施：

将全部只用于搜索的，而不须要做为结果的field（特别是一些比较大的field）的stored设置为false
将不须要被用于搜索的，而只是做为结果返回的field的indexed设置为false
删除全部没必要要的copyField声明
为了索引字段的最小化和搜索的效率，将全部的 text fields的index都设置成field，而后使用copyField将他们都复制到一个总的 text field上，而后对他进行搜索。
为了最大化搜索效率，使用java编写的客户端与solr交互（使用流通讯）
在服务器端运行JVM（省去网络通讯），使用尽量高的Log输出等级，减小日志量。

二、<schema name="example" version="1.2">

name：标识这个schema的名字
version：如今版本是1.2

三、filedType

<fieldTypename="string" class="solr.StrField" sortMissingLast="true" omitNorms="true" />

name：标识而已。
class和其余属性决定了这个fieldType的实际行为。（class以solr开始的，都是在org.appache.solr.analysis包下）

可选的属性：

sortMissingLast和sortMissingFirst两个属性是用在能够内在使用String排序的类型上（包括：string,boolean,sint,slong,sfloat,sdouble,pdate）。
sortMissingLast="true"，没有该field的数据排在有该field的数据以后，而无论请求时的排序规则。
sortMissingFirst="true"，跟上面倒过来呗。
2个值默认是设置成false

StrField类型不被分析，而是被逐字地索引/存储。

StrField和TextField都有一个可选的属性“compressThreshold”，保证压缩到不小于一个大小（单位：char）

solr.TextField 容许用户经过分析器来定制索引和查询，分析器包括一个分词器（tokenizer）和多个过滤器（filter）

positionIncrementGap：可选属性，定义在同一个文档中此类型数据的空白间隔，避免短语匹配错误。
positionIncrementGap=100 只对 multiValue = true 的fieldType有意义。

<tokenizerclass="solr.WhitespaceTokenizerFactory" />

空格分词，精确匹配。

<filterclass="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1" />

在分词和匹配时，考虑 "-"连字符，字母数字的界限，非字母数字字符，这样 "wifi"或"wi fi"都能匹配"Wi-Fi"。

<filterclass="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true" />

同义词

<filterclass="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />

在禁用字（stopword）删除后，在短语间增长间隔

stopword：即在创建索引过程当中（创建索引和搜索）被忽略的词，好比is this等经常使用词。在conf/stopwords.txt维护。

四、fields

<fieldname="id" type="string" indexed="true" stored="true" required="true" />

name：标识而已。
type：先前定义的类型。
indexed：是否被用来创建索引（关系到搜索和排序）
stored：是否储存
compressed：[false]，是否使用gzip压缩（只有TextField和StrField能够压缩）
mutiValued：是否包含多个值
omitNorms：是否忽略掉Norm，能够节省内存空间，只有全文本field和need an index-time boost的field须要norm。（具体没看懂，注释里有矛盾）
termVectors：[false]，当设置true，会存储 term vector。当使用MoreLikeThis，用来做为类似词的field应该存储起来。
termPositions：存储 term vector中的地址信息，会消耗存储开销。
termOffsets：存储 term vector 的偏移量，会消耗存储开销。
default：若是没有属性须要修改，就能够用这个标识下。

<fieldname="text" type="text" indexed="true" stored="false" multiValued="true" />

一应俱全（有点夸张）的field，包含全部可搜索的text fields，经过copyField实现。

<copyFieldsource="cat" dest="text" />

<copyFieldsource="name" dest="text" />

<copyFieldsource="manu" dest="text" />

<copyFieldsource="features" dest="text" />

<copyFieldsource="includes" dest="text" />

在添加索引时，将全部被拷贝field（如cat）中的数据拷贝到text field中

做用：

将多个field的数据放在一块儿同时搜索，提供速度
将一个field的数据拷贝到另外一个，能够用2种不一样的方式来创建索引。

<dynamicFieldname="*_i" type="int" indexed="true" stored="true" />

若是一个field的名字没有匹配到，那么就会用动态field试图匹配定义的各类模式。

"*"只能出如今模式的最前和最后
较长的模式会被先去作匹配
若是2个模式同时匹配上，最早定义的优先

<dynamicFieldname="*" type="ignored" multiValued="true" />

若是经过上面的匹配都没找到，能够定义这个，而后定义个type，当String处理。（通常不会发生）

但若不定义，找不到匹配会报错。

五、其余一些标签

文档的惟一标识，必须填写这个field（除非该field被标记required="false"），不然solr创建索引报错。

若是搜索参数中没有指定具体的field，那么这是默认的域。

<solrQueryParserdefaultOperator="OR" />

配置搜索参数短语间的逻辑，能够是"AND|OR"。