中午12点半,接到了线上MongoDB 数据库异常的告警通报:node
“CPU不间断飙升到百分百,业务也相应出现了抖动现象。”mongodb
经过排查数据库主节点的日志,发现了这样的一个慢语句:数据库
2019-03-07T10:56:43.470+0000 I COMMAND [conn2544772] command nlp.ApplicationDevice appName: "nlp" command: find { find: "ApplicationDevice", filter: { appId: "Gf93VvCfOdttrxSOemt_03ff", tags.tagName: "pipeline", tags.tagValue: "multi", _id: { $gt: ObjectId('000000000000000000000000') } }, projection: { $sortKey: { $meta: "sortKey" } }, sort: { _id: 1 }, limit: 10, shardVersion: [ Timestamp 1000|1000, ObjectId('5c64f941c81e2b922e48e347') ] } planSummary: IXSCAN { appId: 1, tags.tagName: 1, tags.tagValue: 1, _id: 1 } keysExamined:1000002 docsExamined:1000001 hasSortStage:1 cursorExhausted:1 numYields:7829 nreturned:10 reslen:7102 locks:{ Global: { acquireCount: { r: 15660 } }, Database: { acquireCount: { r: 7830 } }, Collection: { acquireCount: { r: 7830 } } } protocol:op_command 4008ms
从语句中初步判断,"keysExamined"和docsExamined 显示扫描了100W 条记录,其中也用到了下面的索引:数组
{ appId: 1, tags.tagName: 1, tags.tagValue: 1, _id: 1 }
跟研发兄弟确认事后,该查询的目的是 找到某些应用下带指定标签的设备信息,按ID分段去获取,每次只查询10条。
关于索引的设计也已经确认过是最优的了,并且此前在开发环境中一直没有出现过问题,不知道为何此次就出问题了。app
接下来,看了下该集合的模型,大体是长下面的样子:测试
/* 1 */ { "appId" : "Gf93VvCfOdttrxSOemt_03ff", "deviceId" : "bbc-lmc-03991933", "nodeType" : "FACTORY", "creationTime" : ISODate("2019-03-01T10:11:39.852Z"), "lastModifiedTime" : ISODate("2019-03-03T10:45:40.249Z"), "tags" : [ { "tagName" : "pipeline", "tagValue" : "multi", "tagType" : 1 } ], ... }
说明
除了其余的属性以外,tags字段采用了嵌套文档数组的结构;
每个元素都对应了一个tag对象,包含 tagName/tagValue/tagType几个字段。ui
而后是查询的模式:设计
//过滤条件 { appId: "Gf93VvCfOdttrxSOemt_03ff", tags.tagName: "pipeline", tags.tagValue: "multi", _id: { $gt: ObjectId('000000000000000000000000') } //排序 sort: { _id: 1 } //限制条数 limit: 10
这从索引的前缀匹配来看,是应该没有问题的,索引的定义以下所示:日志
{ appId: 1, tags.tagName: 1, tags.tagValue: 1, _id: 1 }
为了不对线上环境形成影响,咱们找了一个测试环境来作了尝试复现,执行:code
db.ApplicationDevice.find({ "appId":"Gf93VvCfOdttrxSOemt_03ff", "tags.tagName":"pipeline", "tags.tagValue":"multi", _id:{$gt:ObjectId("000000000000000000000000")}}) .sort({"_id" : 1}) .explain()
结果却跟线上的状况不大同样,此次选中的是**_id**索引!
"winningPlan" : { "stage" : "LIMIT", "limitAmount" : 10, "inputStage" : { "stage" : "SHARDING_FILTER", "inputStage" : { "stage" : "FETCH", "filter" : { "$and" : [ { "appId" : { "$eq" : "Gf93VvCfOdttrxSOemt_03ff" } }, { "tags.tagName" : { "$eq" : "pipeline" } }, { "tags.tagValue" : { "$eq" : "multi" } } ] }, "inputStage" : { "stage" : "IXSCAN", "keyPattern" : { "_id" : 1 }, "indexName" : "_id_", "isMultiKey" : false, "multiKeyPaths" : { "_id" : [ ] }, "isUnique" : true, "isSparse" : false, "isPartial" : false, "indexVersion" : 2, "direction" : "forward", "indexBounds" : { "_id" : [ "(ObjectId('000000000000000000000000'), ObjectId('ffffffffffffffffffffffff')]" ] } } } }
而一样的是也扫描了100W+的记录数,因而你们认为可能索引的选择器出了问题,但就算是选择器的问题也仍然没办法解释线上出现的现象(线上的索引但是命中的)
为了一探究竟,咱们使用 hint 强制让查询命中 **appId_1_tags.tagName_1_tags.tagValue_1__id_1**这个索引:
db.ApplicationDevice.find({ "appId":"Gf93VvCfOdttrxSOemt_03ff", "tags.tagName":"pipeline","tags.tagValue":"multi", _id:{$gt:ObjectId("000000000000000000000000")}}) .sort({"_id" : 1}).limit(10) .hint("appId_1_tags.tagName_1_tags.tagValue_1__id_1") .explain("executionStats")
这一次的结果显示确实命中了对应的索引:
"winningPlan" : { "stage" : "SORT", "sortPattern" : { "_id" : 1.0 }, "limitAmount" : 10, "inputStage" : { "stage" : "SORT_KEY_GENERATOR", "inputStage" : { "stage" : "FETCH", "filter" : { "tags.tagValue" : { "$eq" : "multi" } }, "inputStage" : { "stage" : "IXSCAN", "keyPattern" : { "appId" : 1.0, "tags.tagName" : 1.0, "tags.tagValue" : 1.0, "_id" : 1.0 }, "indexName" : "appId_1_tags.tagName_1_tags.tagValue_1__id_1" ... "executionStats" : { "executionSuccess" : true, "nReturned" : 10, "executionTimeMillis" : 3665, "totalKeysExamined" : 1000002, "totalDocsExamined" : 1000001, "executionStages" : { "stage" : "SORT", "nReturned" : 10, "executionTimeMillisEstimate" : 3513, "works" : 1000014, "sortPattern" : { "_id" : 1.0 }, "memUsage" : 6660, "memLimit" : 33554432, "limitAmount" : 10, "inputStage" : { "stage" : "SORT_KEY_GENERATOR", "nReturned" : 500001, "executionTimeMillisEstimate" : 3333, "works" : 1000003, "advanced" : 500001, "needTime" : 500001, "inputStage" : { "stage" : "FETCH", "filter" : { "tags.tagValue" : { "$eq" : "multi" } }, "nReturned" : 500001, "executionTimeMillisEstimate" : 3087, "works" : 1000002, "advanced" : 500001, "needTime" : 500000, "docsExamined" : 1000001, "alreadyHasObj" : 0, "inputStage" : { "stage" : "IXSCAN", "nReturned" : 1000001, "executionTimeMillisEstimate" : 1117, "works" : 1000002, "advanced" : 1000001, "keyPattern" : { "appId" : 1.0, "tags.tagName" : 1.0, "tags.tagValue" : 1.0, "_id" : 1.0 }, "indexName" : "appId_1_tags.tagName_1_tags.tagValue_1__id_1", "isMultiKey" : true, "multiKeyPaths" : { "appId" : [], "tags.tagName" : [ "tags" ], "tags.tagValue" : [ "tags" ], "_id" : [] }, "indexBounds" : { "appId" : [ "[\"Gf93VvCfOdttrxSOemt_03ff\", \"Gf93VvCfOdttrxSOemt_03ff\"]" ], "tags.tagName" : [ "[\"pipeline\", \"pipeline\"]" ], "tags.tagValue" : [ "[MinKey, MaxKey]" ], "_id" : [ "(ObjectId('000000000000000000000000'), ObjectId('ffffffffffffffffffffffff')]" ] }, "keysExamined" : 1000002, "dupsTested" : 1000001, } } }
然而,在整个执行过程当中(executionStats),出现了内存排序(SORT)。
并且,从一开始命中** appId_1_tags.tagName_1_tags.tagValue_1__id_1 **这个索引的阶段中,就已经扫描了100W条记录,简直难以想象!
但同时,咱们也从indexBounds的指示中找到了端倪:
appId、tags.tagName 都命中了单值,在 tags.tagValue 的路径节点上却覆盖了所有范围!
**因为中间索引节点出现了大范围覆盖,致使最终须要在内存中对大量的数据作 _id字段的排序**,这个就是致使慢操做的缘由!
既然从前面的分析中找到了问题的来源,咱们的推论以下:
再次拿出前面的查询条件:
{ appId: "Gf93VvCfOdttrxSOemt_03ff", tags.tagName: "pipeline", tags.tagValue: "multi", _id: { $gt: ObjectId('000000000000000000000000') }
在索引的匹配中,只能单键命中tags.tagName: "pipeline" 这一个条件,那么因为 tags是一个嵌套文档的数组,
对于上面的查询,语义上是指那些 包含某个元素 可命中tagName,且包含某个元素 可命中 tagValue的文档,这里面并不要求 同一个元素同时命中tagName和tagValue。
但 MongoDB 在嵌套数组索引的构建上是按照同一个元素的字段组合去构建的。 关于这点,能够参考下面的地址:
https://docs.mongodb.com/manual/core/index-multikey/#multikey-embedded-documents
对于数组元素的条件匹配,应该使用 $elemMatch,其用法可参考这里
为此,咱们构建了下面的查询:
db.ApplicationDevice.find({ "appId":"Gf93VvCfOdttrxSOemt_03ff", "tags": {$elemMatch: { "tagName":"pipeline","tagValue":"multi" }}, _id:{$gt:ObjectId("000000000000000000000000")}}) .sort({"_id" : 1}).limit(10) .explain("executionStats")
执行后输出以下:
"winningPlan" : { "stage" : "LIMIT", "limitAmount" : 10, "inputStage" : { "stage" : "IXSCAN", "keyPattern" : { "appId" : 1.0, "tags.tagName" : 1.0, "tags.tagValue" : 1.0, "_id" : 1.0 }, "indexName" : "appId_1_tags.tagName_1_tags.tagValue_1__id_1", } }, "executionStats" : { "executionSuccess" : true, "nReturned" : 10, "executionTimeMillis" : 3, "totalKeysExamined" : 10, "totalDocsExamined" : 10, "executionStages" : { "stage" : "LIMIT", "nReturned" : 10, "inputStage" : { "stage" : "FETCH", "filter" : {...}, "nReturned" : 10,, "inputStage" : { "stage" : "IXSCAN", "nReturned" : 10, "executionTimeMillisEstimate" : 0, "works" : 10, "advanced" : 10, "isEOF" : 0, "indexName" : "appId_1_tags.tagName_1_tags.tagValue_1__id_1", "isMultiKey" : true, "indexBounds" : { "appId" : [ "[\"Gf93VvCfOdttrxSOemt_03ff\", \"Gf93VvCfOdttrxSOemt_03ff\"]" ], "tags.tagName" : [ "[\"pipeline\", \"pipeline\"]" ], "tags.tagValue" : [ "[\"multi\", \"multi\"]" ], "_id" : [ "(ObjectId('000000000000000000000000'), ObjectId('ffffffffffffffffffffffff')]" ] } ...
这个结果是使人满意的,除了自动命中合适的索引以外,这个查询过程也达到了最优的路径匹配,扫描记录数才10条!
最后,根据该方案调整了查询模式,线上的问题获得恢复。
看似很简单的一个查询语句,没想到会出现这么大的坑,其实不管是做为开发人员仍是DBA,都应当谨慎对待你的SQL。 重要的事情说三遍!!! SQl查询上线前务必 explain、务必分析到位,这难道没有道理?