理解Erlang/OTP Supervisor

http://www.cnblogs.com/me-sa/archive/2012/01/10/erlang0030.htmlhtml

Supervisors are used to build an hierarchical process structure called a supervision tree, a nice way to structure a fault tolerant application.
                                                                                                                                                                                     --Erlang/OTP Doc数据库

     Supervisor的基本思想就是经过创建层级结构实现错误隔离和管理,具体方法是经过重启的方式保持子进程一直活着.若是supervisor是进程树的一部分,它会被它的supervisor自动终止,当它的supervisor让它shutdown的时候,它会按照子进程启动顺序的逆序终止其全部的子进程,最后终止掉本身.重启的目的是让系统回归到一个稳定的状态,回归稳定状态后再出现异常能够进行重试,若是初始化都不稳定,后续的监控-重启策略意义不大.换句话说,Application初始化的阶段要有可靠性的保障,初始化阶段可能读取配置文件或者从数据库加载恢复数据,哪怕执行时间长一点都等待同步执行完.若是application依赖非本地数据库或外部服务就能够采起更快的异步启动,由于这种服务在正常使用过程当中也常常出情况,早一点仍是晚一点启动没有什么关系.app

   [Erlang 0025]理解Erlang/OTP - Application以log4erl项目为学习了Erlang/OTP application,咱们说到application在start的方法中启动了log4erl的顶层监控树.今天咱们继续跟进,看log4erl的监控树是怎么构建起来的,并作实验看supervisor如何经过重启恢复服务的.使用application:start(log4erl).启动起来以后的进程树:异步

下面是log4erl_sup文件的start_link方法,supervisor:start_link方法的执行是同步的,直到全部的子进程都启动了才会返回. supervisor:start_link会使用回调函数init/1.ide

复制代码
start_link(Default_logger) ->
R = supervisor:start_link({local, ?MODULE}, ?MODULE, []),
%log4erl:start_link(Default_logger),
add_logger(Default_logger),
?LOG2("Result in supervisor is ~p~n",[R]),
R.

%%回调的方法init/1
init([]) ->
{ok,
{
{one_for_one,3,10},
[]
}
}.
复制代码
  log4erl的顶层监控树的初始化至关简单仅仅定义了重启策略(RestartStrategy)和最大重启频率(maximum restart frequency):{one_for_one,3,10}.
 {one_for_one,3,10}表达的语义是{How, Max, Within}:在多长时间内(Within)重启了几回(Max),如何重启(HOW 重启策略);设计最大重启频率是为了不反复重启进入死循环,一旦超出了此阈值,supervisor进程会结束掉本身以及它全部的子进程,并经过进程树传递退出消息,更上层的supervisor就会采起适当的措施,要么重启终止的supervisor要么本身也终止掉.可能比较纠结这几个值怎么配置,多数资料上都会告诉你"如何配置彻底取决于你的应用程序".这个仍是有经验值的,生成环境的经验值是一小时内重启4次,也能够参考一些和你应用相似的开源项目看看它们是怎么配置的.若是填写的是{one_for_one,0,1}就是不容许重启,下面的示例中能够看到YAWS项目采用了这样的策略.
 
下面几个开源项目顶层supervisor的init方法:
复制代码
%%rabbit_sup.erl  来自大名鼎鼎的rabbitmq
init([]) ->
{ok, {{one_for_all, 0, 1}, []}}.


%%yaws_sup.erl Yaws项目 - Yet Another Web Server
init([]) ->

ChildSpecs = child_specs(),
%% 0, 1 means that we never want supervisor restarts
{ok,{{one_for_all, 0, 1}, ChildSpecs}}.


%%ejabberd_sup ejabberd项目
init([]) ->
Hooks =
{ejabberd_hooks,
{ejabberd_hooks, start_link, []},
%%......................... 省略代码
{ok, {{one_for_one, 10, 1},
[Hooks,
GlobalRouter,
Cluster,
..................
Listener]}}.
复制代码
 
重启策略
 
 one_for_one : 把子进程当成各自独立的,一个进程出现问题其它进程不会受到崩溃的进程的影响.该子进程死掉,只有这个进程会被重启
 one_for_all : 若是子进程终止,全部其它子进程也都会被终止,而后全部进程都会被重启.
 rest_for_one:若是一个子进程终止,在这个进程启动以后启动的进程都会被终止掉.而后终止掉的进程和连带关闭的进程都会被重启.
 simple_one_for_one 是one_for_one的简化版 ,全部子进程都动态添加同一种进程的实例

 one-for-one维护了一个按照启动顺序排序的子进程列表,而simple_one_for_one 因为全部的子进程都是一样的(相同的MFA),使用的是字典来维护子进程信息;
Note: one of the big differences between one_for_one and simple_one_for_one is that one_for_one holds a list of all the children it has (and had, if you don't clear it), started in order, while simple_one_for_one holds a single definition for all its children and works using a dict to hold its data. Basically, when a process crashes, the simple_one_for_one supervisor will be much faster when you have a large number of children.
 
Note: it is important to note that  simple_one_for_one  children are  not respecting this rule with the  Shutdown time. In the case of  simple_one_for_one, the supervisor will just exit and it will be left to each of the workers to terminate on their own, after their supervisor is gone.
 
For the most part, writing a  simple_one_for_one supervisor is similar to writing any other type of supervisor, except for one thing. The argument list in the  {M,F,A} tuple is not the whole thing, but is going to be appended to what you call it with when you do supervisor:start_child(Sup, Args). That's right,  supervisor:start_child/2 changes API. So instead of doing  supervisor:start_child(Sup, Spec), which would call  erlang:apply(M,F,A), we now have supervisor:start_child(Sup, Args), which calls  erlang:apply(M,F,Args++A).

在log4erl_sup.erl的start_link中启动了顶层supervisor以后,添加了一个默认的logger: add_logger(Default_logger),
复制代码
add_logger(Name) when is_atom(Name) ->
N = atom_to_list(Name),
add_logger(N);
add_logger(Name) when is_list(Name) ->
C1 = {Name,
{log_manager, start_link ,[Name]},
permanent,
10000,
worker,
[log_manager]},

?LOG2("Adding ~p to ~p~n",[C1, ?MODULE]),
supervisor:start_child(?MODULE, C1).
复制代码
添加的logger是log4erl_sup的子进程,子进程启动和监控的方式经过child specification来指定.
 C1 =  {Name,{log_manager, start_link ,[Name]},permanent,10000,worker,[log_manager]}
 
C1的六个数据项分别为: {ID, StartEntery, Restart, Shutdown, Type, Modules}:
ID :supervisor 用来在内部区分specification的,因此只要子进程规格说明之间不重复就能够.
Start : 启动参数{M,F,A}
Restart : 这个进程遇到错误以后是否重启
                permanent:遇到任何错误致使进程终止就会重启
                temporary:进程永远都不会被重启
                transient: 只有进程异常终止的时候会被重启
Shutdown 进程如何被干掉,这里是使用整型值2000的意思是,进程在被强制干掉以前有2000毫秒的时间料理后事自行终止.
              实际过程是supervisor给子进程发送一个exit(Pid,shutdown)而后等待exit信号返回,在指定时间没有返回则将子进程使用exit(Child,kill)
             这里的参数还有 brutal_kill 意思是进程立刻就会被干掉
             infinity :当一个子进程是supervisor那么就要用infinity,意思是给supervisor足够的时间进行重启.
Type 这里只有两个值:supervisor worker ; 只要没有实现supervisor behavior的进程都是worker;
                    能够经过supervisor的层级结构来精细化对进程的控制.这个值主要做用是告知监控进程它的子进程是supervisor仍是worker
Modules 是进程依赖的模块,这个信息只有在代码热更新的时候才会被用到:标注了哪些模块须要按照什么顺序进行更新;一般这里只须要列出进程依赖的主模块. 若是子进程是supervisor gen_server gen_fsm Module名是回调模块的名称,这时Modules的值是只有一个元素的列表,元素就是回调模块的名称;若是子进程是gen_event Modules的值是 dynamic;关于dynamic参数余锋有一篇专门的分析:Erlang supervisor规格的dynamic行为分析  http://blog.yufeng.info/archives/1455
 
Modules is a list of one element, the name of the callback module used by the child behavior. The exception to that is when you have callback modules whose identity you do not know beforehand (such as event handlers in an event manager). In this case, the value of  Modules should be  dynamic so that the whole OTP system knows who to contact when using more advanced features, such as  releases.
 
 
实际应用中log4erl中的logger会根据业务逻辑添加多个,咱们也不是直接经过application:start(log4erl).而是调用  log4erl:conf(log4erl.conf)这个方法简单的封装了内部逻辑,实际调用的是 log4erl_conf:conf(File).咱们这里定义一个简单的log4erl.conf文件,使用log4erl:conf(log4erl.conf).启动以后,咱们看看它的进程树是什么样的:
%%log4erl.conf文件 内容我作了简单的缩排
%%mod
logger default_logger{
     file_appender default_app{
    dir = "./log", level = debug, file = default_log, type = size, max = 1000000, suffix = log, rotation = 50, format = ' %d %h:%m:%s.%i %l%n'
     }
}

%%mail mod
logger mail_logger{
     file_appender mail_app{
     dir = "./log", level = debug, file = mail_log, type = size, max = 1000000, suffix = log, rotation = 50, format = ' %d %h:%m:%s.%i %l%n'
     }
}
对应的进程树是这样的,进程之间的红线表示link关系:

咱们沿着调用关系,逐步跟进代码:函数

复制代码
%==== File : log4erl_conf =======

%%log4erl_conf:conf(File).
conf(File) ->
application:start(log4erl), %%启动log4erl
Tree = parse(leex(File)), %%解析配置文件
traverse(Tree). %%遍历配置项构造监控树

%%跟进遍历的逻辑,对于每一条配置执行的是element/1方法
traverse([]) ->
ok;
traverse([H|Tree]) ->
element(H),
traverse(Tree).

%%对于咱们自定义的logger走的是{logger, Logger, Appenders}逻辑
element({cutoff_level, CutoffLevel}) ->
log_filter_codegen:set_cutoff_level(CutoffLevel);
element({default_logger, Appenders}) ->
appenders(Appenders);
element({logger, Logger, Appenders}) ->
log4erl:add_logger(Logger),
appenders(Logger, Appenders).

%==== File : log4erl =======
%%继续跟进咱们走到log4erl:add_logger/1
add_logger(Logger) ->
try_msg({add_logger, Logger}).

%%try_msg 是的添加了异常捕获的通用方法
try_msg(Msg) ->
try
handle_call(Msg)
catch
exit:{noproc, _M} ->
io:format("log4erl has not been initialized yet. To do so, please run~n"),
io:format("> application:start(log4erl).~n"),
{error, log4erl_not_started};
E:M ->
?LOG2("Error message received by log4erl is ~p:~p~n",[E, M]),
{E, M}
end.

%%handle_call的代码片断
handle_call({add_logger, Logger}) ->
log_manager:add_logger(Logger);

%==== File : log_manager =======
%%逻辑转到log_manager的add_logger(Logger)
%%最终调用的是log4erl_sup:add_logger(Logger).这个咱们上面已经分析过了
add_logger(Logger) ->
log4erl_sup:add_logger(Logger).

%%element方法在添加loger以后会添加appender
appenders([]) ->
ok;
appenders([H|Apps]) ->
appender(H),
appenders(Apps).

appenders(_, []) ->
ok;
appenders(Logger, [H|Apps]) ->
appender(Logger, H),
appenders(Logger, Apps).

appender({appender, App, Name, Conf}) ->
log4erl:add_appender({App, Name}, {conf, Conf}).

appender(Logger, {appender, App, Name, Conf}) ->
log4erl:add_appender(Logger, {App, Name}, {conf, Conf}).


%==== File : log4erl =======
%% Appender = {Appender, Name}
add_appender(Logger, Appender, Conf) ->
try_msg({add_appender, Logger, Appender, Conf}).

handle_call({add_appender, Logger, Appender, Conf}) ->
log_manager:add_appender(Logger, Appender, Conf);

%==== File : log_manager =======
add_appender(Logger, {Appender, Name} , Conf) ->
?LOG2("add_appender ~p with name ~p to ~p with Conf ~p ~n",[Appender, Name, Logger, Conf]),
log4erl_sup:add_guard(Logger, Appender, Name, Conf).

%==== File : log4erl_sup =======
add_guard(Logger, Appender, Name, Conf) ->
C = {Name,
{logger_guard, start_link ,[Logger, Appender, Name, Conf]},
permanent,
10000,
worker,
[logger_guard]},
?LOG2("Adding ~p to ~p~n",[C, ?MODULE]),
supervisor:start_child(?MODULE, C).

%==== File : logger_guard =======
start_link(Logger, Appender, Name, Conf) ->
%?LOG2("starting guard for logger ~p~n",[Logger]),
{ok, Pid} = gen_server:start_link(?MODULE, [Appender, Name], []),
case add_sup_handler(Pid, Logger, Conf) of
{error, E} ->
gen_server:call(Pid, stop),
{error, E};
_R ->
{ok, Pid}
end.

add_sup_handler(G_pid, Logger, Conf) ->
?LOG("add_sup()~n"),
gen_server:call(G_pid, {add_sup_handler, Logger, Conf}).

handle_call({add_sup_handler, Logger, Conf}, _From, [{appender, Appender, Name}] = State) ->
?LOG2("Adding handler ~p with name ~p for ~p From ~p~n",[Appender, Name, Logger, _From]),
try
Res = gen_event:add_sup_handler(Logger, {Appender, Name}, Conf),
{reply, Res, State}
catch
E:R ->
{reply, {error, {E,R}}, State}
end;
复制代码

gen_event:add_sup_handler会创建EventManager与Event Handler之间的link的关系,因此咱们修改一下,注释掉这段,看看监控树是什么样子:学习

add_sup_handler(G_pid, Logger, Conf) ->

%    ?LOG("add_sup()~n"),
%    gen_server:call(G_pid, {add_sup_handler, Logger, Conf}).
  ok.ui

注释掉以后能够看到logger和guard之间的link关系就不存在了.

 
kill进程的实验
 
后面咱们会用各类状况杀掉进程,看这个进程树对异常的处理状况;咱们的实验步骤:
1.发送退出消息Reason:some_reason给default_logger
2.发送退出消息Reason:kill 给default_logger
3.发送退出消息Reason:some_reason给logger_guard
4.发送退出消息Reason:some_reason给log4erl_sup
5.发送退出消息Reason:kill 给log4erl_sup
复制代码
3> whereis(default_logger).
<0.45.0>
4> exit(whereis(default_logger),some_reason).
true
5> whereis(default_logger).
<0.45.0>
6> exit(whereis(default_logger),some_reason). %%因为gen_event默认process_flag(trap_exit, true),因此some_reason的退出消息并无把它干掉
true
7> whereis(default_logger).
<0.45.0>
8> exit(whereis(default_logger),kill). %%向进程发送强制退出消息,
true

=SUPERVISOR REPORT==== 10-Jan-2012::10:35:21 === %首先可以看到log4erl报出的子进程终止的报告
Supervisor: {local,log4erl_sup}
Context: child_terminated
Reason: killed
Offender: [{pid,<0.45.0>},
{name,"default_logger"},
{mfargs,{log_manager,start_link,["default_logger"]}},
{restart_type,permanent},
{shutdown,10000},
{child_type,worker}]

=PROGRESS REPORT==== 10-Jan-2012::10:35:21 === %log4erl_sup重建default_logger,新进程pid是<0.69.0>
supervisor: {local,log4erl_sup}
started: [{pid,<0.69.0>},
{name,"default_logger"},
{mfargs,{log_manager,start_link,["default_logger"]}},
{restart_type,permanent},
{shutdown,10000},
{child_type,worker}]

=SUPERVISOR REPORT==== 10-Jan-2012::10:35:21 === %default_logger退出消息转变成为killed继续广播给link的进程,对应的logger_guard终止
Supervisor: {local,log4erl_sup}
Context: child_terminated
Reason: killed
Offender: [{pid,<0.46.0>},
{name,default_app},
{mfargs,
{logger_guard,start_link,
[default_logger,file_appender,default_app,
{conf, [{dir,"./log"},{level,debug},{file,default_log},{type,size},
{max,1000000},{suffix,log}, {rotation,50},
{format," %d %h:%m:%s.%i %l%n"}]}]}},
{restart_type,permanent},
{shutdown,10000},
{child_type,worker}]

=PROGRESS REPORT==== 10-Jan-2012::10:35:21 === %logger_guard 重建
supervisor: {local,log4erl_sup}
started: [{pid,<0.70.0>},
{name,default_app},
{mfargs,
{logger_guard,start_link,
[default_logger,file_appender,default_app,
{conf,
[{dir,"./log"},{level,debug}, {file,default_log},{type,size},
{max,1000000}, {suffix,log},{rotation,50},
{format," %d %h:%m:%s.%i %l%n"}]}]}},
{restart_type,permanent},
{shutdown,10000},
{child_type,worker}]
9> whereis(default_logger).
<0.69.0>
10> is_process_alive(pid(0,70,0)). %这是新启动的logger_guard进程
true
11> exit(pid(0,70,0),some_reason). %向进程发送一个退出消息
true

=SUPERVISOR REPORT==== 10-Jan-2012::11:07:51 ===
Supervisor: {local,log4erl_sup}
Context: child_terminated
Reason: some_reason
Offender: [{pid,<0.70.0>},
{name,default_app},
{mfargs,
{logger_guard,start_link,
[default_logger,file_appender,default_app,
{conf,
[{dir,"./log"},{level,debug},{file,default_log},{type,size},{max,1000000},
{suffix,log},{rotation,50},{format," %d %h:%m:%s.%i %l%n"}]}]}},
{restart_type,permanent},
{shutdown,10000},
{child_type,worker}]

12>
=PROGRESS REPORT==== 10-Jan-2012::11:07:51 ===
supervisor: {local,log4erl_sup}
started: [{pid,<0.76.0>},
{name,default_app},
{mfargs,
{logger_guard,start_link,
[default_logger,file_appender,default_app,
{conf,
[{dir,"./log"},{level,debug},{file,default_log},{type,size},{max,1000000},
{suffix,log},{rotation,50},{format," %d %h:%m:%s.%i %l%n"}]}]}},
{restart_type,permanent},
{shutdown,10000},{child_type,worker}]
12> is_process_alive(pid(0,70,0)).
false
13> whereis(default_logger). %退出消息广播对default_logger没有影响
<0.69.0>
14> whereis(log4erl_sup).
<0.44.0>
15> exit(whereis(log4erl_sup),some_reason). % Supervisor 初始化的时候也会设置 process_flag(trap_exit, true),
true
16> whereis(log4erl_sup).
<0.44.0>
17> exit(whereis(log4erl_sup),kill). %杀掉log4erl_sup 应用程序中止
true

=CRASH REPORT==== 10-Jan-2012::13:26:23 ===
crasher:
initial call: gen_event:init_it/6
pid: <0.69.0>
registered_name: default_logger
exception exit: killed
in function gen_event:terminate_server/4
ancestors: [log4erl_sup,<0.43.0>]
messages: [{'EXIT',<0.76.0>,killed}]
links: [#Port<0.1891>,#Port<0.1885>]
dictionary: []
trap_exit: true
status: running
heap_size: 610
stack_size: 24
reductions: 720
neighbours:
18>
=CRASH REPORT==== 10-Jan-2012::13:26:23 ===
crasher:
initial call: gen_event:init_it/6
pid: <0.47.0>
registered_name: mail_logger
exception exit: killed
in function gen_event:terminate_server/4
ancestors: [log4erl_sup,<0.43.0>]
messages: [{'EXIT',<0.48.0>,killed}]
links: [#Port<0.546>]
dictionary: []
trap_exit: true
status: running
heap_size: 377
stack_size: 24
reductions: 411
neighbours:
18>
=CRASH REPORT==== 10-Jan-2012::13:26:25 ===
crasher:
initial call: application_master:init/4
pid: <0.42.0>
registered_name: []
exception exit: killed
in function application_master:terminate/2
ancestors: [<0.41.0>]
messages: []
links: [<0.6.0>]
dictionary: []
trap_exit: true
status: running
heap_size: 610
stack_size: 24
reductions: 1555
neighbours:
18>
=INFO REPORT==== 10-Jan-2012::13:26:25 ===
application: log4erl
exited: killed
type: temporary
18>
复制代码


 最后再贴一次log4erl项目的地址: http://code.google.com/p/log4erl/,建议下载下来代码本身动手作一下上面的实验.this

相关文章
相关标签/搜索