Elasticsearch:如何对PDF文件进行搜索

时间 2020-07-06

原文原文链接

实现原理node

咱们采用以下的方法来实现把一个 .pdf 文件导入到 Elasticsearch 的数据节点中：json

如上图所示，咱们首先把咱们的 .pdf 文件进行 Base64 处理，而后上传到 Elasticsearch 中的 ingest node 中进行处理。咱们能够经过 Ingest attachment plugin 来使得 Elasticsearch 提取通用格式的文件附件好比 PPT、XLS及PDF。最终，数据进入到 Elasticsearch 的 data node 中以便让咱们进行搜索。数组

导入PDF文件到Elasticsearch中bash

# 准备PDF文件app

咱们可使用 Word 或其它编辑软件来生产一个 PDF 文件，暂且咱们叫这个文件的名字为 sample.pdf，而它的内容很是简单，在 sample.pdf 文件中，咱们只有一句话：“I like this useful tool”。curl

# 安装 Ingest attachment pluginelasticsearch

Ingest attachment plugin 容许 Elasticsearch 经过使用 Apache 文本提取库 Tika 提取通用格式（例如：PPT，XLS 和 PDF）的文件附件。Apache Tika 工具包可从一千多种不一样的文件类型中检测并提取元数据和文本。全部这些文件类型均可以经过一个界面进行解析，从而使 Tika 对搜索引擎索引，内容分析，翻译等有用。工具

须要注意的是，源字段必须是 Base64 编码的二进制，若是不想增长在 Base64 之间来回转换的开销，则可使用 CBOR 格式而不是 JSON，并将字段指定为字节数组而不是字符串表示形式，这样处理器将跳过 Base64 解码。网站

可使用插件管理器安装此插件，该插件必须安装在集群中的每一个节点上，而且每一个节点必须在安装后从新启动。this

bin/elasticsearch-plugin install ingest-attachment
#查看是否安装成功
./bin/elasticsearch-plugin list

# 建立 attachment pipeline

咱们能够在咱们的 ingest node 上建立一个叫作 pdfattachment 的 pipleline：

PUT _ingest/pipeline/pdfattachment
{
  "description": "Extract attachment information encoded in Base64 with UTF-8 charset",
  "processors": [
    {
      "attachment": {
        "field": "file"
      }
    }
  ]
}

# 转换并上传PDF文件的内容到Elasticsearch中

对于 Ingest attachment plugin 来讲，它的数据必须是 Base64 的。咱们能够在网站Base64 encoder 来进行转换，咱们能够直接经过下面的脚原本进行操做：

#!/bin/bash

encodedPdf=`cat sample.pdf | base64`

json="{\"file\":\"${encodedPdf}\"}"

echo "$json" > json.file

curl -XPOST 'http://localhost:9200/pdf-test1/_doc?pipeline=pdfattachment&pretty' -H 'Content-Type: application/json' -d @json.file

在上面的脚本中，咱们针对 sample.pdf 进行 Base64 的转换，并生成一个叫作 json.file 的文件。在最后，咱们把这个 json.file 文件的内容经过 curl 指令上传到 Elasticsearch 中，咱们能够在 Elasticsearch 中查看一个叫作 pdf-test1 的索引。

查看索引并搜索

咱们能够经过以下的命令来查询 pdf-test1 索引：

GET pdf-test1/_search

在上面咱们能够看出来，咱们的索引中有一个叫作 content 的字段，它包含了咱们的 pdf 文件的内容，这个字段能够同咱们进行搜索。在上面咱们也看到了一个很大的一个字段 file，它含有咱们转换过的 Base64 格式的内容。若是咱们不想要这个字段，咱们能够经过添加另一个 remove processor 来除去这个字段：

PUT _ingest/pipeline/pdfattachment
{
  "description": "Extract attachment information encoded in Base64 with UTF-8 charset",
  "processors": [
    {
      "attachment": {
        "field": "file"
      }
    },
    {
      "remove": {
        "field": "file"
      }
    }
  ]
}

这样咱们除去了那个叫作 file 的字段，那么修正后的索引内容为：