SARIF在应用过程当中对深层次需求的实现

时间 2021-04-06

标签 python git github 算法 express 数组安全 ide 函数工具栏目快乐工作繁體版

原文原文链接

摘要：为了下降各类分析工具的结果汇总到通用工做流程中的成本和复杂性, 业界开始采用静态分析结果交换格式(Static Analysis Results Interchange Format (SARIF))来解决这些问题。

本文分享自华为云社区《DevSecOps工具与平台交互的桥梁 -- SARIF进阶》，原文做者：Uncle_Tom。python

1. 引言

目前DevSecOps已经成为构建企业级研发安全的重要模式。静态扫描工具融入在DevSecOps的开发过程当中，对提升产品的总体的安全水平发挥着重要的做用。为了获取安全检查能力覆盖的最大化，开发团队一般会引入多个安全扫描工具。但这也给开发人员和平台带来了更多的问题，为了下降各类分析工具的结果汇总到通用工做流程中的成本和复杂性, 业界开始采用静态分析结果交换格式(Static Analysis Results Interchange Format (SARIF))来解决这些问题。本篇是SARIF应用的入门篇和进阶篇中的进阶篇，将介绍SARIF在应用过程当中对深层次需求的实现。对于SARIF的基础介绍，请参看《DevSecOps工具与平台间交互的桥梁–SARIF入门》。git

2. SARIF 进阶

上次咱们说了SARIF的一些基本应用，这里咱们再来讲下SARIF在更复杂的场景中的一些应用，这样才能为静态扫描工具提供一个完整的报告解决方案。github

在业界著名的静态分析工具Coverity最新的2021.03版本中，新增的功能就包括: 支持在GitHub代码仓中以SARIF格式显示Coverity的扫描结果。可见Covreity也完成了SARIF格式的适配。算法

2.1. 元数据（metadata）的使用

为了不扫描报告过大，对一些重复使用的信息，须要提取出来，作为元数据。例如：规则、规则的消息，扫描的内容等。express

下面的例子中，将规则、规则信息在tool.driver.rules 中进行定义，在扫描结果(results)中直接使用规则编号ruleId来获得规则的信息，同时消息也采用了message.id的方式获得告警信息。这样能够避免规则产生一样告警的大量的重复信息，有效的缩小报告的大小。数组

vscode 中显示以下：安全

 
  {
  "version": "2.1.0",
  "runs": [
    {
      "tool": {
        "driver": {
          "name": "CodeScanner",
          "rules": [
            {
              "id": "CS0001",
              "messageStrings": {
                "default": {
                  "text": "This is the message text. It might be very long."
                }
              }
            }
          ]
        }
      },
      "results": [
        {
          "ruleId": "CS0001",
          "ruleIndex": 0,
          "message": {
            "id": "default"
          }
        }
      ]
    }
  ]
} 
 

2.2. 消息参数的使用

扫描结果的告警每每须要，根据具体的代码问题，在提示消息中给出具体的变量或函数的相关信息，便于用户对问题的理解。这个时候能够采用消息参数的方式，提供可变更缺陷消息。ide

下例中，对规则的消息中采用占位符的方式("{0}")提供信息模板，在扫描结果(results)中，经过arguments数组，提供对应的参数。在vscode中显示以下：函数

 
  {
  "version": "2.1.0",
  "runs": [
    {
      "tool": {
        "driver": {
          "name": "CodeScanner",
          "rules": [
            {
              "id": "CS0001",
              "messageStrings": {
                "default": {
                  "text": "Variable '{0}' was used without being initialized."
                }
              }
            }
          ]
        }
      },
      "results": [
        {
          "ruleId": "CS0001",
          "ruleIndex": 0,
          "message": {
            "id": "default",
            "arguments": [
              "x"
            ]
          }
        }
      ]
    }
  ]
} 
 

2.3. 消息中关联信息的使用

在有些时候，为了更好的说明这个告警的发生缘由，须要给用户提供更多的参考信息，帮助他们理解问题。好比，给出这个变量的定义位置，污染源的引入点，或者其余辅助信息。工具

下例中，经过定义问题的发生位置(locations)的关联位置(relatedLocations)给出了，污染源的引入位置。在vscode中显示以下, 但用户点击“here”时，工具就能够跳转到变量expr引入的位置。

 
  {
  "ruleId": "PY2335",
  "message": {
    "text": "Use of tainted variable 'expr' (which entered the system [here](1)) in the insecure function 'eval'."
  },
  "locations": [
    {
      "physicalLocation": {
        "artifactLocation": {
          "uri": "3-Beyond-basics/bad-eval.py"
        },
        "region": {
          "startLine": 4
        }
      }
    }
  ],
  "relatedLocations": [
    {
      "id": 1,
      "message": {
        "text": "The tainted data entered the system here."
      },
      "physicalLocation": {
        "artifactLocation": {
          "uri": "3-Beyond-basics/bad-eval.py"
        },
        "region": {
          "startLine": 3
        }
      }
    }
  ]
} 
 

2.4. 缺陷分类信息的使用

缺陷的分类对于工具和扫描结果的分析是很是重要的。工具能够依托对缺陷的分类进行规则的管理，方便用户选取须要的规则；另外一方面用户在查看分析报告时，也能够经过对缺陷的分类，快速对分析结果进行过滤。工具能够参考业界的标准，例如咱们经常使用的Common Weakness Enumeration (CWE), 也能够自定义本身的分类，这些SARIF都提供了支持。

缺陷分类的例子

 
  {
  "version": "2.1.0",
  "runs": [
    {
      "taxonomies": [
        {
          "name": "CWE",
          "version": "3.2",
          "releaseDateUtc": "2019-01-03",
          "guid": "A9282C88-F1FE-4A01-8137-E8D2A037AB82",
          "informationUri": "https://cwe.mitre.org/data/published/cwe_v3.2.pdf/",
          "downloadUri": "https://cwe.mitre.org/data/xml/cwec_v3.2.xml.zip",
          "organization": "MITRE",
          "shortDescription": {
            "text": "The MITRE Common Weakness Enumeration"
          },
          "taxa": [
            {
              "id": "401",
              "guid": "10F28368-3A92-4396-A318-75B9743282F6",
              "name": "Memory Leak",
              "shortDescription": {
                "text": "Missing Release of Memory After Effective Lifetime"
              },
              "defaultConfiguration": {
                "level": "warning"
              }
            }
          ],
          "isComprehensive": false
        }
      ],
      "tool": {
        "driver": {
          "name": "CodeScanner",
          "supportedTaxonomies": [
            {
              "name": "CWE",
              "guid": "A9282C88-F1FE-4A01-8137-E8D2A037AB82"
            }
          ],
          "rules": [
            {
              "id": "CA2101",
              "shortDescription": {
                "text": "Failed to release dynamic memory."
              },
              "relationships": [
                {
                  "target": {
                    "id": "401",
                    "guid": "A9282C88-F1FE-4A01-8137-E8D2A037AB82",
                    "toolComponent": {
                      "name": "CWE",
                      "guid": "10F28368-3A92-4396-A318-75B9743282F6"
                    }
                  },
                  "kinds": [
                    "superset"
                  ]
                }
              ]
            }
          ]
        }
      },
      "results": [
        {
          "ruleId": "CA2101",
          "message": {
            "text": "Memory allocated in variable 'p' was not released."
          },
          "taxa": [
            {
              "id": "401",
              "guid": "A9282C88-F1FE-4A01-8137-E8D2A037AB82",
              "toolComponent": {
                "name": "CWE",
                "guid": "10F28368-3A92-4396-A318-75B9743282F6"
              }
            }
          ]
        }
      ]
    }
  ]
} 
 

2.4.1. 业界分类标准的引入（runs.taxonomies）

taxonomies 的定义

 
   "taxonomies": {
    "description": "An array of toolComponent objects relevant to a taxonomy in which results are categorized.",
    "type": "array",
    "minItems": 0,
    "uniqueItems": true,
    "default": [],
    "items": {
      "$ref": "#/definitions/toolComponent"
    }
  }, 
 

taxonomies节点是个数组节点，能够定义多个分类标准。同时taxonomies的定义参考定义组节点definitions下的toolComponent的定义。这与咱们前面的工具扫描引擎(tool.driver)和工具扩展(tool.extensions)保持了一致. 这样设计的缘由是引擎和结果的强相关性，能够经过这样的方法使之保持属性上的一致。

业界标准分类(standard taxonomy)的定义
例子中经过runs.taxonomies节点，声明了业界的分类标准CWE。在节点taxonomies中经过属性节点给出了该规范的描述，下面的只是样例，具体的参考SARIF的规范说明：

name: 规范的名字;
version: 版本;
releaseDateUtc: 发布日期;
guid: 惟一标识，便于其余地方引用此规范；
informationUri: 规则的文档信息;
downloadUri：下载地址；
organization：发布组织
shortDescription：规范的短描述。

2.4.2. 自定义分类的引入(runs.taxonomies.taxa)

taxa是个数组节点，为了缩小报告的尺寸，没有必要将全部自定义的分类信息都放在taxa节点下面，只须要列出和本次扫描相关的分类信息就够了。这也是为何后面标识是否全面(isComprehensive)节点的默认值是false的缘由。

例子中经过taxa节点引入了一个工具须要的分类：CWE-401 内存泄漏，并用guid 和id，作了这个分类的惟一标识，便于后面工具在规则或缺陷中引用这个标识。

2.4.3. 工具与业界分类标准关联(tool.driver.supportedTaxonomies)

工具对象经过tool.driver.supportedTaxonomies节点和定义的业界分类标准关联。supportedTaxonomies的数组元素是toolComponentReference对象，由于分类法taxonomies自己是toolComponent对象。 toolComponentReference.guid属性与run.taxonomies []中定义的分类法的对象的guid属性匹配。

例子中supportedTaxonomies.name:CWE, 它表示此工具支持CWE分类法，并用引用了taxonomies[0]中的guid：A9282C88-F1FE-4A01-8137-E8D2A037AB82，使之与业界分类标准CWE关联。

2.5. 规则与缺陷分类关联(rule.relationships)

规则是在tool.driver.rules节点下定义，rules是个数组节点，规则经过数组元素中的reportingDescriptor对象定义；
每一个规则(ReportingDescriptor)中的relationships是个数组元素，每一个元素都是一个reportingDescriptorRelationship对象，该对象创建了从该规则到另外一个reportingDescriptor对象的关系。关系的目标能够是分类法中的分类单元（如本例中所示），也能够是另外一个工具组件中的另外一个规则；
关系(ReportingDescriptorRelationship)中的target属性标识关系的目标，它的值是一个reportingDescriptorReference对象，由此引用对象toolComponent中的reportingDescriptor；
reportingDescriptorReference对象中的toolComponent是一个toolComponentReference对象, 指向工具supportedTaxonomies中定义的分类。

下图为例子中的规则与缺陷分类的关联图：

2.5.1. 扫描结果中的分类(result.taxa)

在扫描结果(run.results)中, 每个结果(result)下，有一个属性分类(taxa), taxa是一个数组元素，数组中的每一个元素指向reportingDescriptorReference对象，用于指定该缺陷的分类。这个与规则对应分类的方式同样。从这一点也能够看出，咱们能够省略result下的taxa，而是经过规则对应到缺陷的分类。

2.6. 代码流（Code Flow)

一些工具经过模拟程序的执行来检测问题，有时跨多个执行线程。 SARIF经过一组位置信息模拟执行过程，像代码流(Code Flow)同样。 SARIF代码流包含一个或多个线程流，每一个线程流描述了单个执行线程上按时间顺序排列的代码位置。

2.6.1. 缺陷代码流组（result.codeFlows）

因为缺陷中，可能存在不止一个代码流，所以可选的result.codeFlows属性是一个数组形式的codeFlow对象。

 
   "result": {
      "description": "A result produced by an analysis tool.",
      "additionalProperties": false,
      "type": "object",
      "properties": {

        ... ...
        "codeFlows": {
          "description": "An array of 'codeFlow' objects relevant to the result.",
          "type": "array",
          "minItems": 0,
          "uniqueItems": false,
          "default": [],
          "items": {
            "$ref": "#/definitions/codeFlow"
          }
        },
      }
   } 
 

2.6.2. 代码流的线程流组（codeFlow.threadFlows）

codeFlow的定义能够看到，每一个代码流有，由一个线程组(threadFlows)构成，且线程组(threadFlows)是必须的。

 
   "codeFlow": {
      "description": "A set of threadFlows which together describe a pattern of code execution relevant to detecting a result.",
      "additionalProperties": false,
      "type": "object",
      "properties": {

        "message": {
          "description": "A message relevant to the code flow.",
          "$ref": "#/definitions/message"
        },

        "threadFlows": {
          "description": "An array of one or more unique threadFlow objects, each of which describes the progress of a program through a thread of execution.",
          "type": "array",
          "minItems": 1,
          "uniqueItems": false,
          "items": {
            "$ref": "#/definitions/threadFlow"
          }
        },
      },

      "required": [ "threadFlows" ]
    }, 
 

2.6.3. 线程流（threadFlow）和线程流位置（threadFlowLocation）

在每一个线程流(threadFlow)中，一个数组形式的位置组(locations)来描述工具对代码的分析过程。

线程流（threadFlow）定义：

 
   "threadFlow": {
      "description": "Describes a sequence of code locations that specify a path through a single thread of execution such as an operating system or fiber.",
      "type": "object",
      "additionalProperties": false,
      "properties": {

        "id": {
        ...

        "message": {
        ...  

        "initialState": {
        ...

        "immutableState": {
        ...

        "locations": {
          "description": "A temporally ordered array of 'threadFlowLocation' objects, each of which describes a location visited by the tool while producing the result.",
          "type": "array",
          "minItems": 1,
          "uniqueItems": false,
          "items": {
            "$ref": "#/definitions/threadFlowLocation"
          }
        },

        "properties": {
        ...
      },

      "required": [ "locations" ]
    }, 
 

线程流位置（threadFlowLocation）定义：
位置组(locations)中的每一个元素, 又是经过threadFlowLocation来表示工具的对代码位置的访问。最终经过location类型的location属性给出分析的位置信息。location能够包含物理和逻辑位置信息，所以codeFlow也能够用于二进制的分析流的表示。

在threadFlowLocation还有一个state属性的节点，咱们能够经过它来存储变量、表达式的值或者符号表信息，或者用于状态机的表述。

 
   "threadFlowLocation": {
      "description": "A location visited by an analysis tool while simulating or monitoring the execution of a program.",
      "additionalProperties": false,
      "type": "object",
      "properties": {

        "index": {
          "description": "The index within the run threadFlowLocations array.",
        ...
 
        "location": {
          "description": "The code location.",
          "$ref": "#/definitions/location"
        },

        "state": {
          "description": "A dictionary, each of whose keys specifies a variable or expression, the associated value of which represents the variable or expression value. For an annotation of kind 'continuation', for example, this dictionary might hold the current assumed values of a set of global variables.",
          "type": "object",
          "additionalProperties": {
            "$ref": "#/definitions/multiformatMessageString"
          }
        },
        ...
      }
    }, 
 

2.6.4. 代码流样例

参考代码

 
# 3-Beyond-basics/bad-eval-with-code-flow.py

print("Hello, world!")
expr = input("Expression> ")
use_input(expr)

def use_input(raw_input):
   print(eval(raw_input)) 
 

上面是一个python代码的代码注入的一个案例。

在第四行，输入信息赋值给变量expr；
在第五行，变量expr经过函数use_input的第一个参数，进入到函数use_input;
在第八行，经过函数print打印输入结果，但这里使用了函数eval()对输入参数进行了处理，因为参数在输入后，未通过检验，就直接用于函数eval的处理，这里可能会引入代码注入的安全问题。

这个分析过程能够经过下面的扫描结果表现出来，便于用户理解问题的发生过程。

扫描结果

 
  {
  "version": "2.1.0",
  "runs": [
    {
      "tool": {
        "driver": {
          "name": "PythonScanner"
        }
      },
      "results": [
        {
          "ruleId": "PY2335",
          "message": {
            "text": "Use of tainted variable 'raw_input' in the insecure function 'eval'."
          },
          "locations": [
            {
              "physicalLocation": {
                "artifactLocation": {
                  "uri": "3-Beyond-basics/bad-eval-with-code-flow.py"
                },
                "region": {
                  "startLine": 8
                }
              }
            }
          ],
          "codeFlows": [
            {
              "message": {
                "text": "Tracing the path from user input to insecure usage."
              },
              "threadFlows": [
                {
                  "locations": [
                    {
                      "message": {
                        "text": "The tainted data enters the system here."
                      },
                      "location": {
                        "physicalLocation": {
                          "artifactLocation": {
                            "uri": "3-Beyond-basics/bad-eval-with-code-flow.py"
                          },
                          "region": {
                            "startLine": 4
                          }
                        }
                      },
                      "state": {
                        "expr": {
                          "text": "42"
                        }
                      },
                      "nestingLevel": 0
                    },
                    {
                      "message": {
                        "text": "The tainted data is used insecurely here."
                      },
                      "location": {
                        "physicalLocation": {
                          "artifactLocation": {
                            "uri": "3-Beyond-basics/bad-eval-with-code-flow.py"
                          },
                          "region": {
                            "startLine": 8
                          }
                        }
                      },
                      "state": {
                        "raw_input": {
                          "text": "42"
                        }
                      },
                      "nestingLevel": 1
                    }
                  ]
                }
              ]
            }
          ]
        }
      ]
    }
  ]
} 
 

这里只是一个简单的示例，经过SARIF的codeFLow，咱们能够适应更加复杂的分析过程，从而让用户更好的理解问题，进而快速作出判断和修改。

2.7. 缺陷指纹（fingerprint）

在大型软件项目中，分析工具一次就能够产生成千上万个结果。为了处理如此多的结果，在缺陷管理上，咱们须要记录现有缺陷，制定一个扫描基线，而后对现有问题进行处理。同时在后期的扫描中，须要将新的扫描结果与基线进行比较，以区分是否有新问题的引入。为了肯定后续运行的结果在逻辑上是否与基线的结果相同，必须经过一种算法:使用缺陷结果中包含的特有信息来构造一个稳定的标识，咱们将此标识称为指纹。使用这个指纹来标识这个缺陷的特征以区别于其余缺陷，咱们也称这个指纹为这个缺陷的缺陷指纹。

缺陷指纹应该包含相对稳定不变的缺陷信息：

产生结果的工具的名称；
规则编号；
分析目标的文件系统路径；这个路径应该是工程自己具备的相对路径。不该该包含路径前面工程存放位置信息，由于每台机器存放工程的位置可能不一样；
缺陷特征值（partialFingerprints）。

SARIF的每一个扫描结果(result)中提供了一组这样的属性节点，用于缺陷指纹的存放，便于缺陷的管理系统经过这些标识，识别缺陷的惟一性。

 
   "result": {
      "description": "A result produced by an analysis tool.",
      "additionalProperties": false,
      "type": "object",
      "properties": {
        ... ...
        "guid": {
          "description": "A stable, unique identifier for the result in the form of a GUID.",
          "type": "string",
          "pattern": "^[0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[1-5][0-9a-fA-F]{3}-[89abAB][0-9a-fA-F]{3}-[0-9a-fA-F]{12}$"
        },

        "correlationGuid": {
          "description": "A stable, unique identifier for the equivalence class of logically identical results to which this result belongs, in the form of a GUID.",
          "type": "string",
          "pattern": "^[0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[1-5][0-9a-fA-F]{3}-[89abAB][0-9a-fA-F]{3}-[0-9a-fA-F]{12}$"
        },

        "occurrenceCount": {
          "description": "A positive integer specifying the number of times this logically unique result was observed in this run.",
          "type": "integer",
          "minimum": 1
        },

        "partialFingerprints": {
          "description": "A set of strings that contribute to the stable, unique identity of the result.",
          "type": "object",
          "additionalProperties": {
            "type": "string"
          }
        },

        "fingerprints": {
          "description": "A set of strings each of which individually defines a stable, unique identity for the result.",
          "type": "object",
          "additionalProperties": {
            "type": "string"
          }
        },
        ... ...
      }
    } 
 

只经过缺陷的固有的信息特征，在某些状况下，不容易获得惟一识别结果的信息。这个时候咱们须要增长一些与这个缺陷强相关的一些属性值，作为附加信息来加入到缺陷指纹的计算中，使最后的计算获得的指纹惟一。这个有些像咱们作加密算法时的盐值，只是这个盐值须要保证生成的惟一值具备可重复性，以确保下次扫描时，对于同一缺陷可以获得相同的输入值，从而获得和上次同样的指纹。例如，工具在检查文档中是否存在敏感性的单词，告警信息为：“ xxx不该在文档中使用。”，这个时候就可使用这个单词做为这个缺陷的一个特征值。

SARIF格式就提供了这样一个partialFingerprints属性，用于保存这个特征值，以容许SARIF生态系统中的分析工具和其余组件使用这个信息。缺陷管理系统能够将其附加到为每一个结果构造的指纹中。前面的例子中，该工具就能够会将partialFingerprints对象中的属性的值设置为：禁止的单词。缺陷管理系统应该在其指纹计算中将信息包括在partialFingerprints中。

对于partialFingerprints，应该只添加和缺陷特征强相关的属性，并且属性的值应该相对稳定。好比，缺陷发生的代码行号就不适合加入到指纹的的逻辑运算中，由于代码行是一个会常常变更的值，在下次扫描的时候，极可能由于开发人员在问题行前添加或删除了一些代码行，而使一样的问题在新的扫描报告中获得不同的代码行，从而影响缺陷指纹的计算值，致使比对时发生差别。

尽管咱们试图为每一个缺陷找到惟一的标识特征，还加入了一些可变的特征属性，但仍是很难设计出一种算法来构造一个真正稳定的指纹结果。好比刚才的例子，若是同一个文件中存在几个一样的敏感字，咱们这个时后仍是没法为每个告警缺陷给出一个惟一的标识。固然这个时候还能够加入函数名做为一个指纹的计算因子，由于函数名在一个程序中是相对稳定的存在，函数名的加入有助于区分同一个文件中同一个问题的出现范围，但仍是会存在同一个函数内一样问题的多个相同缺陷。因此尽管咱们尽可能区分每个告警，但缺陷指纹相同的场景在实际的扫描中仍是会存在的。

幸运的是，出于实际目的，指纹并不必定要绝对稳定。它只须要足够稳定，就能够将错误报告为“新”的结果数量减小到足够低的水平，以使开发团队能够无需过多努力就能够管理错误报告的结果。

3. 总结

SARIF给出了静态扫描工具的标准输出的通用格式，可以知足静态扫描工具报告输出的各类要求；
对于各类静态扫描工具整合到DevSecOps平台，SARIF将下降扫描结果汇总到通用工做流程中的成本和复杂性；
SARIF也将为IDE整合各类扫描结果，提供统一的缺陷处理模块提供了可能；扫描结果在IDE中的缺陷展现、修复等，这样可让工具的开发商专一于问题的发现，而减小对各类IDE的适配的工做量；
SARIF已经成为OASIS的标准之一，并被微软、GrammaTech等重要静态扫描工具厂商在工具中提供支持；同时U.S. DHS, U.S. NIST在一些静态检查工具的评估和比赛中，也要求提供扫描报告的格式采用SARIF；
SARIF虽然目前主要是为静态扫描工具的结果设计的，但因为其设计的通用性，一些动态分析工具厂商也给出了SARIF的成功应用。

4. Reference

Industry leaders collaborate to define SARIF interoperability standard for detecting software defects and vulnerabilities
OASIS Awards 2018 Open Standards Cup to KMIP for Key Management Security and SARIF for Static Analysis Tools
OASIS Static Analysis Results Interchange Format (SARIF) Technical Committee
SARIF Specification
SARIF Tutorials
Vscode Extension: Sarif Viewer
SARIF-SDK
Fortify FPR to SARIF
GrammaTech SARIF integration for GitHub
Static Analysis Results: A Format and a Protocol: SARIF & SASP
浅谈 language server & LSIF & SARIF & Babelfish & Semantic & Tree-sitter & Kythe & Glean等

点击关注，第一时间了解华为云新鲜技术~