前端监控平台系列：JS SDK（已开源）

时间 2020-10-03

原文原文链接

本文做者：cjinhuo，未经受权禁止转载。javascript

传统方式下一个前端项目发到正式环境后，全部报错信息只能经过用户使用时截图、口头描述发送到开发者，而后开发者来根据用户所描述的场景去模拟这个错误的产生，这效率确定超级低，因此不少开源或收费的前端监控平台就应运而生，好比:前端

等等一些优秀的监控平台vue

<h2 style="margin-top: 25px; margin-bottom: 15px; padding: 0px; font-weight: bold; color: black; font-size: 20px;" data-id="heading-7"><span style="display: none;" class="prefix"></span><span style="color: #2db7f5; display: inline-block; padding-left: 10px;" class="content">国内经常使用的监控平台</span><span style="display: none;" class="suffix"></span></h2>java

sentry ：从监控错误、错误统计图表、多重标签过滤和标签统计到触发告警，这一整套都很完善，团队项目须要充钱，并且数据量越大钱越贵react

fundebug：除了监控错误，还能够录屏，也就是记录错误发生的前几秒用户的全部操做，压缩后的体积只有几十 KB，但操做略微繁琐git

webfunny：也是含有监控错误的功能，能够支持千万级别日PV量，额外的亮点是能够远程调试、性能分析，也能够docker私有化部署（免费），业务代码加密过github

<h2 style="margin-top: 25px; margin-bottom: 15px; padding: 0px; font-weight: bold; color: black; font-size: 20px;" data-id="heading-7"><span style="display: none;" class="prefix"></span><span style="color: #2db7f5; display: inline-block; padding-left: 10px;" class="content">为何不选择上面三个监控平台或者其余监控平台，为何要本身搞？</span><span style="display: none;" class="suffix"></span></h2>web

首先sentry和fundebug须要投入大量金钱来做为支持，而webfunny虽是能够用docker私有化部署，但因为其代码没有开源，二次开发受限
本身开发能够将公司全部的SDK统一成一个，包括但不限于：埋点平台SDK、性能监控SDK

从上图能够看出来，若是须要自研监控平台须要作三个部分：

APP监控SDK：收集错误信息并上报
server端：接收错误信息，处理数据并作持久化，然后根据告警规则通知对应的开发人员
可视化平台：从数据存储引擎拿出相关错误信息进行渲染，用于快速定位问题

<h1 style="padding: 0px; font-weight: bold; color: black; font-size: 24px; text-align: center; line-height: 60px; margin-top: 10px; margin-bottom: 10px;">
<span style="font-size: 24px; color: #2db7f5; border-bottom: 2px solid #2db7f5;" class="content">监控SDK</span>
</h1>
<h2 style="margin-top: 30px; margin-bottom: 15px; padding: 0px; font-weight: bold; color: black; font-size: 20px;" data-id="heading-7"><span style="display: none;" class="prefix"></span><span style="font-size: 20px; color: #2db7f5; display: inline-block;" class="content">总体代码架构</span><span style="display: none;" class="suffix"></span></h2>

总体代码架构使用发布-订阅设计模式以便后续迭代功能，处理逻辑基本都在HandleEvents文件中,这样设计的好处是若是想穿插hook或者迭代功能能够在处理事件回调多添加一个函数

<h2 style="margin-top: 30px; margin-bottom: 15px; padding: 0px; font-weight: bold; color: black; font-size: 20px;" data-id="heading-7"><span style="display: none;" class="prefix"></span><span style="font-size: 20px; color: #2db7f5; display: inline-block; padding-left: 10px;" class="content">web错误信息收集</span><span style="display: none;" class="suffix"></span></h2>

通常状况下都是经过重写js原生事件而后拿到错误信息，好比ajax请求，经过重写xhr、fetch事件来截取接口信息，因此咱们须要优先编写一个易于重写事件的函数来复用。

全部的请求第三方库都是基于xhr、fetch二次封装的，因此只须要重写这两个事件就能够拿到全部的接口请求的信息，经过判断status的值来判断当前接口是不是正常的。举个例子，重写xhr的代码操做：

上面除了拿去接口的信息以外还作一个操做：若是是SDK发送的接口，就不用收集该接口的信息。若是须要发布事件就调用triggerHandlers(EVENTTYPES.XHR, this.mito_xhr)，相似的，fetch也是用这种方式来重写。

关于接口跨域、超时的问题：这两种状况发生的时候，接口返回的响应体和响应头里面都是空的，status等于0，因此很难区分二者，可是正常状况下，通常项目中都的请求都是复杂请求，因此在正式请求会先进行option进行预请求，若是是跨域的话基本几十毫秒就会返回来，因此以此做为临界值来判断跨域与超时的问题（若是是接口不存在也会被判断成接口跨域）。

<h3 style="margin-top: 20px; margin-bottom: 15px; padding: 0px; font-weight: bold; color: black; font-size: 20px;" data-id="heading-7"><span style="display: none;" class="prefix"></span><span style="font-size: 16px; color: #2db7f5; display: inline-block; padding-left: 10px; border-left: 4px solid #2db7f5;" class="content">js代码错误&&资源错误</span><span style="display: none;" class="suffix"></span></h3>

监听window的error事件

window.addEventListener('error',function(e){
  // 拿到错误信息，发布事件：triggerHandlers
}, true)

资源错误

判断e.target.localName是否有值，有的话就是资源错误，在handleErrors中拿到信息：

代码错误

上面判断为false时，表明是代码错误，在回调中能够拿到对应的错误代码文件、代码行数等等信息，而后经过source-map这个npm包+sourceMap文件进行解析，就能够还原出线上真实代码错误的位置。

<h3 style="margin-top: 20px; margin-bottom: 15px; padding: 0px; font-weight: bold; color: black; font-size: 20px;" data-id="heading-7"><span style="display: none;" class="prefix"></span><span style="font-size: 16px; color: #2db7f5; display: inline-block; padding-left: 10px; border-left: 4px solid #2db7f5;" class="content">监听unhandledrejection</span><span style="display: none;" class="suffix"></span></h3>

当Promise 被 reject 且没有 reject 处理器的时候，会触发 unhandledrejection 事件

<h2 style="margin-top: 25px; margin-bottom: 15px; padding: 0px; font-weight: bold; color: black; font-size: 20px;"><span style="display: none;" class="prefix"></span><span style="color: #2db7f5; display: inline-block;" class="content">用户行为信息收集</span><span style="display: none;" class="suffix"></span></h2>

单纯收集错误信息是能够提升错误定位的效率，但若是再配合上用户行为的话就锦上添花，定位错误的效率再上一层，以下图所示，能够清晰的看到用户作了哪些事：进了哪一个页面 => 点击了哪一个按钮 => 触发了哪一个接口：

<h3 style="margin-top: 20px; margin-bottom: 15px; padding: 0px; font-weight: bold; color: black; font-size: 20px;" data-id="heading-7"><span style="display: none;" class="prefix"></span><span style="font-size: 16px; color: #2db7f5; display: inline-block; padding-left: 10px; border-left: 4px solid #2db7f5;" class="content">dom事件信息</span><span style="display: none;" class="suffix"></span></h3>

dom事件获取包括不少：click、input、doubleClick等等，一种直接在window上面监听click事件（注意第三个参数为true）:

window.addEventListener('click',function(e){
    // 利用节流，以防事件触发过快
  // 发布事件 triggerHandlers
}, true)

还有一种是经过重写window.addEventListener的方式来截取开发者对dom的监听事件。

<h3 style="margin-top: 20px; margin-bottom: 15px; padding: 0px; font-weight: bold; color: black; font-size: 20px;" data-id="heading-7"><span style="display: none;" class="prefix"></span><span style="font-size: 16px; color: #2db7f5; display: inline-block; padding-left: 10px; border-left: 4px solid #2db7f5;" class="content">路由切换信息</span><span style="display: none;" class="suffix"></span></h3>

在单页应用中有两种路由变换：hashchange、history

history

当浏览器支持history模式时，会被如下两个事件所影响：pushState、replaceState，且这两个事件不会触发onpopstate的回调，因此咱们须要监听这个三个事件：

hashchange

当浏览器只支持hashchange时，就须要重写hashchange:

<h3 style="margin-top: 20px; margin-bottom: 15px; padding: 0px; font-weight: bold; color: black; font-size: 20px;" data-id="heading-7"><span style="display: none;" class="prefix"></span><span style="font-size: 16px; color: #2db7f5; display: inline-block; padding-left: 10px; border-left: 4px solid #2db7f5;" class="content">console信息</span><span style="display: none;" class="suffix"></span></h3>

正常状况下正式环境是不该该有console的，那为何要收集console的信息？第一：非正常状况下，正式环境或预发环境也可能会有console，第二：不少时候也能够把sdk放入测试环境上面调试。因此最终仍是决定收集console信息，可是在初始化的时候的传参来告诉sdk是否监听console的信息收集。

<h2 style="margin-top: 25px; margin-bottom: 15px; padding: 0px; font-weight: bold; color: black; font-size: 20px;"><span style="display: none;" class="prefix"></span><span style="color: #2db7f5; display: inline-block;" class="content">框架层错误信息收集</span><span style="display: none;" class="suffix"></span></h2>

vue2.6官网提供了两个报错函数的回调：Vue.config.errorHandler和Vue.config.warnHandler

<h3 style="margin-top: 20px; margin-bottom: 15px; padding: 0px; font-weight: bold; color: black; font-size: 20px;"><span style="display: none;" class="prefix"></span><span style="font-size: 16px; color: #2db7f5; display: inline-block; padding-left: 10px; border-left: 4px solid #2db7f5;" class="content">React</span><span style="display: none;" class="suffix"></span></h3>

React16.13中提供了componentDidCatch钩子函数来回调错误信息，因此咱们能够新建一个类ErrorBoundary来继承React，而后而后声明componentDidCatch钩子函数，能够拿到错误信息（目前没写react的错误收集，看官网文档简述，简易版应该是这样写的）。

<h2 style="margin-top: 25px; margin-bottom: 15px; padding: 0px; font-weight: bold; color: black; font-size: 20px;"><span style="display: none;" class="prefix"></span><span style="color: #2db7f5; display: inline-block;" class="content">自定义上报错误</span><span style="display: none;" class="suffix"></span></h2>

上面收集的是web端的代码错误、接口报错和框架层面的报错等等，还有一种是业务错误信息：好比点击支付的时候，可能服务端接口返回200，可是响应体是错误信息，就须要手动上报这块的错误信息。既然要手动上报，SDK就须要提供一个全局函数功能开发者调用：

import MITO from 'mitojs'
MITO.log({
  info: '支付失败，余额不足',
  tag: 'business'
})

<h2 style="margin-top: 25px; margin-bottom: 15px; padding: 0px; font-weight: bold; color: black; font-size: 20px;"><span style="display: none;" class="prefix"></span><span style="color: #2db7f5; display: inline-block;" class="content">Breadcrumb收集</span><span style="display: none;" class="suffix"></span></h2>

在上面收集完错误信息的时候，都在最后追加一行breadcrumb.push(data)，这样就能够保存用户的行为轨迹，默认状况设置20长度，也能够在初始化时可配置，可是建议最高不要超过100，由于若是信息过多，内存占用过大，对页面不太友好。

在每一个事件类型的回调的时候都将类型整合：好比用户点击、路由跳转都是属于用户行为，这样作的缘由是让开发者更好过滤无用信息和精准定位到须要的信息。

<h2 style="margin-top: 25px; margin-bottom: 15px; padding: 0px; font-weight: bold; color: black; font-size: 20px;"><span style="display: none;" class="prefix"></span><span style="color: #2db7f5; display: inline-block;" class="content">Error id生成</span><span style="display: none;" class="suffix"></span></h2>

每一个错误事件触发时都会有不少信息，咱们须要尽可能保证每一个不一样信息的错误生成的id不同，这边采起的措施是先将每一个错误的对象key按照必定规则递归排序，而后根据每一个对象的值进行hashCode，获得一串errorId

<h2 style="margin-top: 25px; margin-bottom: 15px; padding: 0px; font-weight: bold; color: black; font-size: 20px;"><span style="display: none;" class="prefix"></span><span style="color: #2db7f5; display: inline-block;" class="content">上报错误信息</span><span style="display: none;" class="suffix"></span></h2>

当SDK拿到错误的全部信息时须要上报到服务端，有几种方式上报服务端

<h3 style="margin-top: 20px; margin-bottom: 15px; padding: 0px; font-weight: bold; color: black; font-size: 20px;"><span style="display: none;" class="prefix"></span><span style="font-size: 16px; color: #2db7f5; display: inline-block; padding-left: 10px; border-left: 4px solid #2db7f5;" class="content">经过xhr上报</span><span style="display: none;" class="suffix"></span></h3>

经过xhr上报，若是设置成异步的时候，当用户跳转新页面或者关闭页面时就会丢失当前这个请求，若是设置成同步，又会让页面形成卡顿的现象

sentry目前是经过xhr发送的，不过它在发送前会推到它设置的一个请求缓冲区 _buffer，以此来优化并发请求过多的问题。

<h3 style="margin-top: 20px; margin-bottom: 15px; padding: 0px; font-weight: bold; color: black; font-size: 20px;"><span style="display: none;" class="prefix"></span><span style="font-size: 16px; color: #2db7f5; display: inline-block; padding-left: 10px; border-left: 4px solid #2db7f5;" class="content">Image的形式来发送请求</span><span style="display: none;" class="suffix"></span></h3>

特色：

没有跨域问题、
发 GET 请求以后不须要获取和处理数据、
服务器也不须要发送数据、
不会携带当前域名 cookie、不会阻塞页面加载，影响用户的体验，只需 new Image 对象、
相比于 BMP/PNG 体积最小，能够节约 41% / 35% 的网络资源小

<h3 style="margin-top: 20px; margin-bottom: 15px; padding: 0px; font-weight: bold; color: black; font-size: 20px;"><span style="display: none;" class="prefix"></span><span style="font-size: 16px; color: #2db7f5; display: inline-block; padding-left: 10px; border-left: 4px solid #2db7f5;" class="content">Navigator.sendBeacon</span><span style="display: none;" class="suffix"></span></h3>

MDN：可用于经过HTTP将少许数据异步传输到Web服务器，统计和诊断代码一般要在 unload 或者 beforeunload 事件处理器中发起一个同步 XMLHttpRequest 来发送数据。同步的 XMLHttpRequest 迫使用户代理延迟卸载文档，并使得下一个导航出现的更晚。下一个页面对于这种较差的载入表现无能为力

特色：

发出的是异步请求，而且是POST请求
发出的请求，是放到的浏览器任务队列执行的，脱离了当前页面，因此不会阻塞当前页面的卸载和后面页面的加载过程，用户体验较好
只能判断出是否放入浏览器任务队列，不能判断是否发送成功
Beacon API不提供相应的回调，所以后端返回最好省略response body
兼容性不是很友好

<h2 style="margin-top: 25px; margin-bottom: 15px; padding: 0px; font-weight: bold; color: black; font-size: 20px;"><span style="display: none;" class="prefix"></span><span style="color: #2db7f5; display: inline-block;" class="content">用户惟一标识</span><span style="display: none;" class="suffix"></span></h2>

为了方便统计用户量，在每次上报的时候会带一个惟一标识符trackerId，生成这个trackerId的途径有两种：

若是你是用ajax上报的话，发现cookie中没有带trackerId这个字段，服务端生成并setCookie设置到用户端的cookie
直接用SDK生成，在每次上报以前都判断localstorage是否存在trackerId，有则随着错误信息一块儿发送，没有的话生成一个并设置到localstorage

<h2 style="margin-top: 25px; margin-bottom: 15px; padding: 0px; font-weight: bold; color: black; font-size: 20px;"><span style="display: none;" class="prefix"></span><span style="color: #2db7f5; display: inline-block;" class="content">SDK小结</span><span style="display: none;" class="suffix"></span></h2>

订阅事件 => 重写原生事件 => 触发原生事件（发布事件） => 拿到错误信息 => 提取有用的错误信息 => 上报服务端

SDK开源:mitojs，下一篇会讲服务端的表结构设计思路、怎样在千万条数据中多重标签毫秒级查询错误事件以及更好的告警机制通知开发人员

感兴趣的小伙伴能够点个关注，后续好文不断！！！