lucene_01_入门程序

时间 2019-12-11

标签 lucene 入门程序繁體版

原文原文链接

索引和搜索流程图：java

一、绿色表示索引过程，对要搜索的原始内容进行索引构建一个索引库，索引过程包括:
肯定原始内容即要搜索的内容->采集文档->建立文档->分析文档->素引文档
二、红色表示搜索过程，从索弓库中搜索内容,
搜索过程包括:
用户经过搜索界面->建立查询子执行搜索，从索引库搜索->渲染搜索结果数据库

索引和搜索操做的对象为：索引库。apache

索引库中包含的部分：索引、原始文档。app

原始文档：要索引和搜索的内容。原始内容包括互联网上的网页、数据库中的数据、磁盘上的文件等。maven

建立文档对象ui

获取原始内容的目的是为了索引，在索引前须要将原始內容建立成文档(Document),
文档中包括一个一个的域(Field),域中存储内容。
这里咱们能够将磁盘上的-一个文件当成一个document,Document 中包括-一些Field
(file_mame文件名称、file_path文件路径、file_size 文件大小、file_content文件内容),以下图：url

注意: spa

每一个文档能够有多个Field,code

不一样的文档能够有不一样的Field, —— 对于数据库办不到，每一行看做是一个document（文档），每一列看做是一个Filed.数据库的每一行的字段是固定的。xml

同一个文档能够有相同的Field (域名和域值都相同)。—— 数据库中也不能有重复的字段

每一个文档都有一个惟一的编号，就是文档id。—— 不一样数据库的 id,该id不是域（对应于数据库的字段），没法进行操做，由系统维护。

域：是能够被咱们操做的。

分析文档

将原始内容建立为包含域(Field) 的文档(document),须要再对域中的内容进行分析，
分析的过程是通过对原始文档提取单词、将字母转为小写、去除标点符号、去除停用词等过
程生成最终的语汇单元，能够将语汇单元理解为一个一个的单词。

好比下面的文档通过分析以后。
原文档内容:
Lucene is a Java full-text search engine.Lucene is not a completer
application,but rather a code library and API that can easily be used
to add search capabilities to applications.

分析后获得的语汇单元。
lucene、java、full、search、engine。。。
每一个单词叫作一个Term,不一样的域中拆分出来的相同的单词是不一样的term。
含两部分一部分是文档的域名，另外一部分是单词的内容。。
例如: 文件名中包含apache和文件内容中包含的apache是不一样的term.

建立索引

对全部文档分析得出的语汇单元进行索引，索引的目的是为了搜索，最终要实现值搜索被索引的语汇单元，从而找到文档（document）

注意: 建立索引是对语汇单元索引，经过词语找文档，这种索引的结构叫倒排索引结构。
传统方法是根据文件找到该文件的内容，在文件内容中匹配搜索关键字，这种方法是顺
序扫描方法，数据量大、搜索慢。

倒排索引结构是经过内容找文档，以下图：

倒排索引结构也叫反向索引结构，包括索引和文档两个部分，索引即词汇表，它的规模较小，而文档集合较大。

入门代码实现

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
  <modelVersion>4.0.0</modelVersion>

  <groupId>com.chen</groupId>
  <artifactId>lucene</artifactId>
  <version>1.0-SNAPSHOT</version>
  <packaging>jar</packaging>

  <name>lucene</name>
  <url>http://maven.apache.org</url>

  <properties>
    <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
  </properties>

  <dependencies>
    <dependency>
      <groupId>junit</groupId>
      <artifactId>junit</artifactId>
      <version>3.8.1</version>
      <scope>test</scope>
    </dependency>

    <!-- https://mvnrepository.com/artifact/org.apache.lucene/lucene-core -->
    <dependency>
      <groupId>org.apache.lucene</groupId>
      <artifactId>lucene-core</artifactId>
      <version>7.2.1</version>
    </dependency>

    <!-- https://mvnrepository.com/artifact/org.apache.lucene/lucene-queryparser -->
    <dependency>
      <groupId>org.apache.lucene</groupId>
      <artifactId>lucene-queryparser</artifactId>
      <version>7.2.1</version>
    </dependency>

    <!-- https://mvnrepository.com/artifact/org.apache.lucene/lucene-analyzers-common -->
    <dependency>
      <groupId>org.apache.lucene</groupId>
      <artifactId>lucene-analyzers-common</artifactId>
      <version>7.2.1</version>
    </dependency>

    <!-- https://mvnrepository.com/artifact/commons-io/commons-io -->
    <dependency>
      <groupId>commons-io</groupId>
      <artifactId>commons-io</artifactId>
      <version>2.6</version>
    </dependency>
    <dependency>
      <groupId>junit</groupId>
      <artifactId>junit</artifactId>
      <version>RELEASE</version>
    </dependency>


  </dependencies>

  <build>
    <plugins>
      <plugin>
        <groupId>org.apache.maven.plugins</groupId>
        <artifactId>maven-compiler-plugin</artifactId>
        <version>3.6.0</version>
        <configuration>
          <source>1.8</source>
          <target>1.8</target>
        </configuration>
      </plugin>
    </plugins>
  </build>
</project>

建立索引

Field类

注在该版本中已经抛弃了LongField方法

  @Test
    public void createIndex() throws Exception{
//        第一步; 建立一个java工程，并导入jar包。
//        第二步: 建立一个indexwriter对象。
//        1) 指定索引库的存放位置Directory对象

        Directory directory = FSDirectory.open(Paths.get("F:\\lucene\\indexDatabase"));

//        2) 指定一个分听器，对文档内容进行分析。
        IndexWriterConfig config = new IndexWriterConfig(new StandardAnalyzer());
        IndexWriter indexWriter = new IndexWriter(directory,config);
//        第四步: 建立field对象，将field添加到document对象中。
        File filedir = new File("F:\\lucene\\document");
        if(filedir.exists() && filedir.isDirectory()){
            File[] files = filedir.listFiles();
            for (File file:files) {
//        第三步，创|建document对象。
                Document document = new Document();
                //获取文件的名称
                String fileName = file.getName();
                //建立textfield，保存文件名（key,value,是否存储）
                TextField fileNameField = new TextField("fileName",fileName, Field.Store.YES);
                //文件大小
                long fileSize = FileUtils.sizeOf(file);
//                new NumericDocValuesField("fileSize",fileSize);
                SortedNumericDocValuesField fileSizeField = new SortedNumericDocValuesField("fileSize", fileSize);
                //文件路径
                String filePath = file.getPath();
                StoredField filePathField = new StoredField("filePath", filePath);
                //文件内容
                String fileContent = FileUtils.readFileToString(file,"gbk");
                TextField fileContentField = new TextField("fileContent", fileContent, Field.Store.YES);
                document.add(fileNameField);
                document.add(fileSizeField);
                document.add(filePathField);
                document.add(fileContentField);

//        第五步: 使用indexwriter对象将document对象写入索引库，此过程进行索引建立。并将索引和document对象写) 索引库。
                indexWriter.addDocument(document);
            }
        }
//        第六步: 关闭IndexWriter对象。
        indexWriter.close();

    }

查询索引

搜索索引过程:
根据查询语法在倒排索引词典表中分别找出对应搜索词的索引,从而找到索引所连接的文档链表。
好比搜索语法为“fileName:lucee 表示搜索出fileName 域中包含Lucene 的文档。
搜索过程就是在索引上查找域为fileName,而且关键字为Llucene 的term,并根据term 找到文档id 列表。

@Test
    public void testSearcher() throws IOException {
//        第一步: 建立一个Directory 对象，也就是索引库存放的位置。
        Directory directory = FSDirectory.open(Paths.get("F:\\lucene\\indexDatabase"));
//        第二步: 建立一个indexReader 对象，须要指定Directory 对象。
        IndexReader indexReader = DirectoryReader.open(directory);
//        第三步: 建立一个indexsearcher 对象，须要指定InclexReader 对象。
        IndexSearcher indexSearcher = new IndexSearcher(indexReader);
//        第四步: 建立一个TermQuery对象，指定查询的域和查询的关键词。
//        Term term = new Term("fileName", "java");
        Term term = new Term("fileContent", "store");
        Query query = new TermQuery(term);
//        第五步: 执行查询。
        TopDocs topDocs = indexSearcher.search(query, 13);
//        第六步: 返回查询结果。遍历查询结果并输出。
        ScoreDoc[] scoreDocs = topDocs.scoreDocs;
        for (ScoreDoc doc : scoreDocs) {
            int docIndex = doc.doc;
            Document document = indexSearcher.doc(docIndex);
            String fileName = document.get("fileName");
            System.out.println(fileName);
            String fileSize = document.get("fileSize");
            System.out.println(fileSize);
            String filePath = document.get("filePath");
            System.out.println(filePath);
            String fileContent = document.get("fileContent");
            System.out.println(fileContent);
            System.out.println("==========================");
        }
//        第七步: 关闭IndexReader 对象。
        indexReader.close();

    }

1. springMVC入门程序
2. MyBatis入门程序
3. HelloWorld入门程序
4. java入门程序
5. mybatis入门程序
6. springMVC---入门程序
7. 02_BootStrap——入门程序
8. SpringBoot入门程序
9. 小程序入门
10. SpringMVC入门程序
更多相关文章...
• Memcached入门教程 - NoSQL教程
• Neo4j数据库入门教程 - NoSQL教程
• YAML 入门教程
• Java Agent入门实战（一）-Instrumentation介绍与使用