Elasticsearch： Join数据类型

时间 2020-07-25

原文原文链接

在Elasticsearch中，Join可让咱们建立parent/child关系。Elasticsearch不是一个RDMS。一般join数据类型尽可能不要使用，除非不得已。那么Elasticsearch为何须要Join数据类型呢？html

在Elasticsearch中，更新一个object须要root object一个完整的reindex：数据库

即便是一个field的一个字符的改变
即使是nested object也须要完整的reindex才能够实现搜索

一般状况下，这是彻底OK的，可是在有些场合下，若是咱们有频繁的更新操做，这样可能对性能带来很大的影响。app

若是你的数据须要频繁的更新，并带来性能上的影响，这个时候，join数据类型多是你的一个解决方案。elasticsearch

join数据类型能够彻底地把两个object分开，可是仍是保持这二者以前的关系。ide

parent及child是彻底分开的两个文档
parent能够单独更新而不须要从新reindex child
children能够任意被添加/串改/删除而不影响parent及其它的children

与 nested类型相似，父子关系也容许您将不一样的实体关联在一块儿，但它们在实现和行为上有所不一样。与nested文档不一样，它们不在同一文档中，而parent/child文档是彻底独立的文档。它们遵循一对多关系原则，容许您将一种类型定义为parent类型，将一种或多种类型定义为child类型性能

即使join数据类型给咱们带来了方便，可是，它也在搜索时给我带来额外的内存及计算方便的开销。ui

注意：目前Kibana对nested及join数据类型有比较少的支持。若是你想使用Kibana来在dashboard里展现数据，这个方面的你须要考虑。在将来，这种状况可能会发生改变。spa

**join数据类型是一个特殊字段，用于在同一索引的文档中建立父/子关系。关系部分定义文档中的一组可能关系，每一个关系是父（parent)名称和子（child)名称。 **code

一个例子：htm

PUT my_index
    {
      "mappings": {
        "properties": {
          "my_join_field": { 
            "type": "join",
            "relations": {
              "question": "answer" 
            }
          }
        }
      }
    }

在这里咱们定义了一个叫作my_index的索引。在这个索引中，咱们定义了一个field，它的名字是my_join_field。它的类型是join数据类型。同时咱们定义了单个关系：question是answer的parent。

要使用join来index文档，必须在source中提供关系的name和文档的可选parent。例如，如下示例在question上下文中建立两个parent文档：

PUT my_index/_doc/1?refresh
    {
      "text": "This is a question",
      "my_join_field": {
        "name": "question" 
      }
    }
     
    PUT my_index/_doc/2?refresh
    {
      "text": "This is another question",
      "my_join_field": {
        "name": "question"
      }
    }

这里采用refresh来强制进行索引，以便接下来的搜索。在这里name标识question，说明这个文档时一个question文档。

索引parent文档时，您能够选择仅将关系的名称指定为快捷方式，而不是将其封装在普通对象表示法中：

PUT my_index/_doc/1?refresh
    {
      "text": "This is a question",
      "my_join_field": "question" 
    }
     
    PUT my_index/_doc/2?refresh
    {
      "text": "This is another question",
      "my_join_field": "question"
    }

这种方法和前面的是同样的，只是这里咱们只使用了question, 而不是一个像第一种方法那样，使用以下的一个对象来表达：

"my_join_field": {
        "name": "question"
      }

在实际的使用中，你能够根据本身的喜爱来使用。

索引child项时，必须在_source中添加关系的名称以及文档的parent id。

注意：须要在同一分片中索引父级的谱系，必须使用其parent的id来确保这个child和parent是在一个shard中。每一个文档分配在那个shard之中在默认的状况下是按照文档的id进行一些hash来分配的，固然也能够经过routing来进行。针对child，咱们使用其parent的id，这样就能够保证。不然在咱们join数据的时候，跨shard是很是大的一个消费。

例如，如下示例显示如何索引两个child文档：

PUT my_index/_doc/3?routing=1?refresh  (1)
    {
      "text": "This is an answer",
      "my_join_field": {
        "name": "answer",   (2)
        "parent": "1"       (3)
      }
    }
     
    PUT my_index/_doc/4?routing=1?refresh
    {
      "text": "This is another answer",
      "my_join_field": {
        "name": "answer",
        "parent": "1"
      }
    }

在上面的（1）处，咱们必须使用routing，这样能确保parent和child是在同一个shard里。咱们这里routing为1，这是由于parent的id 为1，在（3）处定义。(2) 处定义了该文档join的名称。

parent-join及其性能

join字段不该像关系数据库中的链接同样使用。在Elasticsearch中，良好性能的关键是将数据去规范化为文档。每一个链接字段has_child或has_parent查询都会对查询性能产生重大影响。

join字段有意义的惟一状况是，若是您的数据包含一对多关系，其中一个实体明显超过另外一个实体。这种状况的一个例子是产品的用例和这些产品的报价。若是提供的产品数量明显多于产品数量，则将产品建模为父文档并将产品建模为子文档是有意义的。

parent-join的限制

对于每一个index来讲，只能有一个join字段
parent及child文档，必须是在一个shard里创建索引。这也意味着，一样的routing值必须应用于getting, deleting或updating一个child文档。
一个元素能够有多个children，可是只能有一个parent.
能够对已有的join项添加新的关系
也能够将child添加到现有元素，但仅当元素已是parent时才能够。

针对parent-join的搜索

parent-join建立一个字段来索引文档中关系的名称（my_parent，my_child，...）。

它还为每一个parent/child关系建立一个字段。此字段的名称是join字段的名称，后跟＃和关系中parent的名称。所以，例如对于my_parent⇒[my_child，another_child]关系，join字段会建立一个名为my_join_field＃my_parent的附加字段。

若是文档是子文件（my_child或another_child），则此字段包含文档连接到的parent_id，若是文档是parent文件（my_parent），则包含文档的_id。

搜索包含join字段的索引时，始终在搜索响应中返回这两个字段：

上面的描述比较绕口，咱们仍是以一个例子来讲说明吧：

GET my_index/_search
    {
      "query": {
        "match_all": {}
      },
      "sort": ["_id"]
    }

这里咱们搜索全部的文档，并以_id进行排序：

{
      "took" : 2,
      "timed_out" : false,
      "_shards" : {
        "total" : 1,
        "successful" : 1,
        "skipped" : 0,
        "failed" : 0
      },
      "hits" : {
        "total" : {
          "value" : 4,
          "relation" : "eq"
        },
        "max_score" : null,
        "hits" : [
          {
            "_index" : "my_index",
            "_type" : "_doc",
            "_id" : "1",
            "_score" : null,
            "_source" : {
              "text" : "This is a question",
              "my_join_field" : "question" (1)
            },
            "sort" : [
              "1"
            ]
          },
          {
            "_index" : "my_index",
            "_type" : "_doc",
            "_id" : "2",
            "_score" : null,
            "_source" : {
              "text" : "This is another question",
              "my_join_field" : "question" (2)
            },
            "sort" : [
              "2"
            ]
          },
          {
            "_index" : "my_index",
            "_type" : "_doc",
            "_id" : "3",
            "_score" : null,
            "_routing" : "1",
            "_source" : {
              "text" : "This is an answer",
              "my_join_field" : {
                "name" : "answer", (3)
                "parent" : "1"     (4)
              }
            },
            "sort" : [
              "3"
            ]
          },
          {
            "_index" : "my_index",
            "_type" : "_doc",
            "_id" : "4",
            "_score" : null,
            "_routing" : "1",
            "_source" : {
              "text" : "This is another answer",
              "my_join_field" : {
                "name" : "answer",
                "parent" : "1"
              }
            },
            "sort" : [
              "4"
            ]
          }
        ]
      }
    }

在这里，咱们能够看到4个文档：

(1)代表这个文档是一个question join (2)代表这个文档是一个question join (3)代表这个文档是一个answer join (4)代表这个文档的parent是id为1的文档

Parent-join 查询及aggregation

能够在aggregation和script中访问join字段的值，并可使用parent_id查询进行查询：

GET my_index/_search
    {
      "query": {
        "parent_id": { 
          "type": "answer",
          "id": "1"
        }
      }
    }

咱们经过查询parent_id，返回全部parent_id为1的全部answer类型的文档：

{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 0.35667494,
    "hits" : [
      {
        "_index" : "my_index",
        "_type" : "_doc",
        "_id" : "4",
        "_score" : 0.35667494,
        "_routing" : "1",
        "_source" : {
          "text" : "This is another answer",
          "my_join_field" : {
            "name" : "answer",
            "parent" : "1"
          }
        }
      },
      {
        "_index" : "my_index",
        "_type" : "_doc",
        "_id" : "3",
        "_score" : 0.35667494,
        "_routing" : "1",
        "_source" : {
          "text" : "This is an answer",
          "my_join_field" : {
            "name" : "answer",
            "parent" : "1"
          }
        }
      }
    ]
  }
}

在这里，咱们能够看到返回id为3和4的文档。咱们也能够对这些文档进行aggregation:

GET my_index/_search
    {
      "query": {
        "parent_id": {
          "type": "answer",
          "id": "1"
        }
      },
      "aggs": {
        "parents": {
          "terms": {
            "field": "my_join_field#question",
            "size": 10
          }
        }
      },
      "script_fields": {
        "parent": {
          "script": {
            "source": "doc['my_join_field#question']"
          }
        }
      }
    }

就像咱们在上一节中介绍的那样，在咱们的应用实例中，在index时，它也建立一个额外的一个字段，虽然在source里咱们看不到。这个字段就是my_join_filed#question，这个字段含有parent _id。在上面的查询中，咱们首先查询全部的parent_id为1的全部的answer类型的文档。接下来对全部的文档以parent_id进行聚合：

{
      "took" : 0,
      "timed_out" : false,
      "_shards" : {
        "total" : 1,
        "successful" : 1,
        "skipped" : 0,
        "failed" : 0
      },
      "hits" : {
        "total" : {
          "value" : 2,
          "relation" : "eq"
        },
        "max_score" : 0.35667494,
        "hits" : [
          {
            "_index" : "my_index",
            "_type" : "_doc",
            "_id" : "4",
            "_score" : 0.35667494,
            "_routing" : "1",
            "fields" : {
              "parent" : [
                "1"
              ]
            }
          },
          {
            "_index" : "my_index",
            "_type" : "_doc",
            "_id" : "3",
            "_score" : 0.35667494,
            "_routing" : "1",
            "fields" : {
              "parent" : [
                "1"
              ]
            }
          }
        ]
      },
      "aggregations" : {
        "parents" : {
          "doc_count_error_upper_bound" : 0,
          "sum_other_doc_count" : 0,
          "buckets" : [
            {
              "key" : "1",
              "doc_count" : 2
            }
          ]
        }
      }
    }

一个parent对应多个child

对于一个parent来讲，咱们能够定义多个child，好比：

PUT my_index
    {
      "mappings": {
        "properties": {
          "my_join_field": {
            "type": "join",
            "relations": {
              "question": ["answer", "comment"]  
            }
          }
        }
      }
    }

在这里，question是answer及comment的parent。

多层的parent join

虽然这个不建议，这样作可能会可能在query时带来更多的内存及计算方面的开销：

PUT my_index
    {
      "mappings": {
        "properties": {
          "my_join_field": {
            "type": "join",
            "relations": {
              "question": ["answer", "comment"],  
              "answer": "vote" 
            }
          }
        }
      }
    }

这里question是answer及comment的parent，同时answer也是vote的parent。它代表了以下的关系：

索引grandchild文档需routing值等于grand-parent（谱系里的更大parent）：

PUT my_index/_doc/3?routing=1&refresh 
    {
      "text": "This is a vote",
      "my_join_field": {
        "name": "vote",
        "parent": "2" 
      }
    }

这个child文档必须是和他的grand-parent在一个shard里。在这里它使用了1，也即question的id。同时，对于vote来讲，它的parent必须是它的parent，也即answer的id。