SLS机器学习最佳实战:日志聚类+异常告警

0.文章系列连接



1.手中的锤子都有啥?

围绕日志,挖掘其中更大价值,一直是咱们团队所关注。在原有日志实时查询基础上,今年SLS在DevOps领域完善了以下功能:html

  • 上下文查询
  • 实时Tail和智能聚类,以提升问题调查效率
  • 提供多种时序数据的异常检测和预测函数,来作更智能的检查和预测
  • 数据分析的结果可视化
  • 强大的告警设置和通知,经过调用webhook进行关联行动

 

今天咱们重点介绍下,日志只能聚类和异常告警如何配合,更好的进行异常发现和告警web

2.平台实验

2.1 实验数据

一份Sys Log的原始数据,,而且开启了日志聚类服务,具体的状态截图以下:session

 

经过调整下面截图中红色框1的大小,能够改变图中红色框2的结果,可是对于每一个最细粒度的pattern并不会改变,也就是说:子Pattern的结果是稳定且惟一的,咱们能够经过子Pattern的Signature找到对应的原始日志条目。机器学习

 

2.2 生成子模式的时序信息

假设,咱们对这个子Pattern要进行监控:函数

msg:vm-111932.tc su: pam_unix(*:session): session closed for user root
对应的 signature_id : __log_signature__: 1814836459146662485

咱们获得了上述pattern对应的原始日志,能够看下具体的数量在时间轴上的直返图:学习

 

上图中,咱们能够发现,这个模式的日志分布不是很均衡,其中还有一些是没有的,若是直接按照时间窗口统计数量,获得的时序图以下:spa

__log_signature__: 1814836459146662485 |  
select 
    date_trunc('minute', __time__) as time, 
    COUNT(*) as num 
from log GROUP BY time order by time ASC limit 10000

 

 

上述图中咱们发现时间上并非连续的。所以,咱们须要对这条时序进行补点操做。
__log_signature__: 1814836459146662485 | 
select 
    time_series(time, '1m', '%Y-%m-%d %H:%i:%s', '0') as time, 
    avg(num) as num 
from  ( 
    select 
        __time__ - __time__ % 60 as time, 
        COUNT(*) as num 
    from log GROUP BY time order by time desc ) 
GROUP by time order by time ASC limit 10000

 

 

2.3 对时序进行异常检测

使用时序异常检测函数: ts_predicate_arma3d

__log_signature__: 1814836459146662485 | 
select 
    ts_predicate_arma(to_unixtime(time), num, 5, 1, 1, 1, 'avg') 
from  ( 
    select 
        time_series(time, '1m', '%Y-%m-%d %H:%i:%s', '0') as time, 
        avg(num) as num 
    from  ( 
        select 
            __time__ - __time__ % 60 as time, 
            COUNT(*) as num 
        from log GROUP BY time order by time desc ) 
    GROUP by time order by time ASC ) limit 10000

 

 

2.4 告警该如何设置

  • 将机器学习函数的结果拆解开
__log_signature__: 1814836459146662485 | 
select 
    t1[1] as unixtime, t1[2] as src, t1[3] as pred, t1[4] as up, t1[5] as lower, t1[6] as prob 
from  ( 
    select 
        ts_predicate_arma(to_unixtime(time), num, 5, 1, 1, 1, 'avg') as res 
    from  ( 
        select 
            time_series(time, '1m', '%Y-%m-%d %H:%i:%s', '0') as time, 
            avg(num) as num 
        from  ( 
            select 
                __time__ - __time__ % 60 as time, 
                COUNT(*) as num 
            from log GROUP BY time order by time desc ) 
        GROUP by time order by time ASC )) , unnest(res) as t(t1)

 

 

  • 针对最近两分钟的结果进行告警
__log_signature__: 1814836459146662485 | 
select 
    unixtime, src, pred, up, lower, prob 
from  ( 
    select 
        t1[1] as unixtime, t1[2] as src, t1[3] as pred, t1[4] as up, t1[5] as lower, t1[6] as prob 
    from  ( 
        select 
            ts_predicate_arma(to_unixtime(time), num, 5, 1, 1, 1, 'avg') as res 
        from  ( 
            select 
                time_series(time, '1m', '%Y-%m-%d %H:%i:%s', '0') as time, 
                avg(num) as num 
            from  ( 
                select 
                    __time__ - __time__ % 60 as time, COUNT(*) as num 
                from log GROUP BY time order by time desc ) 
            GROUP by time order by time ASC )) , unnest(res) as t(t1) ) 
    where is_nan(src) = false order by unixtime desc limit 2

 

 

  • 针对上升点进行告警,并设置兜底策略
__log_signature__: 1814836459146662485 | 
select 
    sum(prob) as sumProb, max(src) as srcMax, max(up) as upMax 
from ( 
    select 
        unixtime, src, pred, up, lower, prob 
    from  ( 
        select 
            t1[1] as unixtime, t1[2] as src, t1[3] as pred, t1[4] as up, t1[5] as lower, t1[6] as prob 
        from  ( 
            select 
                ts_predicate_arma(to_unixtime(time), num, 5, 1, 1, 1, 'avg') as res 
            from  ( 
                select 
                    time_series(time, '1m', '%Y-%m-%d %H:%i:%s', '0') as time, avg(num) as num 
                from  ( 
                    select 
                        __time__ - __time__ % 60 as time, COUNT(*) as num 
                    from log GROUP BY time order by time desc ) 
                GROUP by time order by time ASC )) , unnest(res) as t(t1) ) 
        where is_nan(src) = false order by unixtime desc limit 2 )

 

 

具体的告警设置以下:unix

 


3.硬广时间

3.1 日志进阶

这里是日志服务的各类功能的演示 日志服务总体介绍,各类Demo日志

 

 

原文连接

本文为云栖社区原创内容,未经容许不得转载。

相关文章
相关标签/搜索