因为误操做在erlcron
设置了一个超过3个月后的定时任务。而后次日以后发现天天的daily reset没有被执行,一些定时任务也没有被执行。瞬间感受整我的都很差了,怎么无故端就不执行了呢。html
经过排查日志,发现了如下报错:git
2016-03-22 16:54:32.014 [error] gen_server ecrn_control terminated with reason: no case clause matching {ok,[<0.14123.1577>,<0.13079.1576>,<0.25254.1569>,<0.13402.1577>,...]} in ecrn_control:internal_cancel/1 line 111 2016-03-22 16:54:32.015 [error] CRASH REPORT Process ecrn_control with 0 neighbours exited with reason: no case clause matching {ok,[<0.14123.1577>,<0.13079.1576>,<0.25254.1569>,<0.13402.1577>,...]} in ecrn_control:internal_cancel/1 line 111 in gen_server:terminate/6 line 744
我擦,ecrn_control
都崩了,怎么回事。github
找到具体出错的代码:web
internal_cancel(AlarmRef) -> case ecrn_reg:get(AlarmRef) of undefined -> undefined; {ok, [Pid]} -> ecrn_agent:cancel(Pid) end.
发现调用ecrn_reg:get(AlarmRef)
被返回了{ok, List},并且这个List的数据远不止一个。明显在设置那个超过3个月的定时任务的时候,ecrn_reg
被注册进了脏数据。express
erlcron:cron({{once, 1000}, {io, fwrite, ["Hello, world!~n"]}}). erlcron:cron({{once, 1000}, {io, fwrite, ["Hello, world!~n"]}}). erlcron:cron({{once, 1000}, {io, fwrite, ["Hello, world!~n"]}}).
查看observer:start()
能够看到进程树以下:canvas
erlcron:cron({{once, 4294968}, {io, fwrite, ["Hello, world!~n"]}}).
结果就gg了,好多崩溃信息是否是:ruby
22:49:16.818 [error] CRASH REPORT Process <0.5822.64> with 0 neighbours crashed with reason: timeout_value in gen_server:loop/6 line 358 22:49:16.818 [error] Supervisor ecrn_cron_sup had child ecrn_agent started with ecrn_agent:start_link(#Ref<0.0.11.11209>, {{once,4294968},{io,fwrite,["Hello, world!~n"]}}) at <0.5822.64> exit with reason timeout_value in context child_terminated 22:49:16.819 [error] CRASH REPORT Process <0.5701.64> with 0 neighbours crashed with reason: timeout_value in gen_server:loop/6 line 358 22:49:16.821 [error] Supervisor ecrn_cron_sup had child ecrn_agent started with ecrn_agent:start_link(#Ref<0.0.11.11209>, {{once,4294968},{io,fwrite,["Hello, world!~n"]}}) at <0.5701.64> exit with reason timeout_value in context child_terminated 22:49:16.821 [error] CRASH REPORT Process <0.6237.64> with 0 neighbours crashed with reason: timeout_value in gen_server:loop/6 line 358 22:49:16.821 [error] Supervisor ecrn_cron_sup had child ecrn_agent started with ecrn_agent:start_link(#Ref<0.0.11.11209>, {{once,4294968},{io,fwrite,["Hello, world!~n"]}}) at <0.6237.64> exit with reason timeout_value in context child_terminated 22:49:16.821 [error] CRASH REPORT Process <0.5862.64> with 0 neighbours crashed with reason: timeout_value in gen_server:loop/6 line 358 22:49:16.821 [error] Supervisor ecrn_cron_sup had child ecrn_agent started with ecrn_agent:start_link(#Ref<0.0.11.11209>, {{once,4294968},{io,fwrite,["Hello, world!~n"]}}) at <0.5862.64> exit with reason timeout_value in context child_terminated ...(总共有25条)
再看一下进程数:markdown
我擦,为毛原来的 scrn_agent 进程也没有了。app
能够发现,erlcron 在尝试了25次设置 这个定时任务以后,也就是 scrn_agent 崩溃了25次以后,原来设置的三个正常的定时任务的scrn_agent 进程也没有掉了。 也就是说,不但我新设置的定时任务没有成功,并且我原来正常的定时任务也没有掉了。ide
再看一下崩溃日志里面的崩掉的进程号,每个都是不同的。能够推算其实原来的报错ecrn_reg:get(AlarmRef)
获取到了多个Pid,其实就是这里插入失败的定时任务产生的25个Pid。也就是说,虽然ecrn_agent
进程崩溃了,可是ecrn_reg
仍是保存了这些Pid。因此在取消这些定时任务的时候,ecrn_reg:get(AlarmRef)
返回的内容在internal_cancel(AlarmRef)
没有被匹配到。
为何设置了4294968
秒后的定时任务就崩溃了。这个数估计不少人很熟悉,2^32=4294967296
,而4294968000
也就是恰好大于2^32
。即,若是设置的定时任务超过了2^32
毫秒,在erlcron
里面就不支持了。
查看gen_server:loop
的源码,找到引发崩溃的代码:
loop(Parent, Name, State, Mod, hibernate, Debug) -> proc_lib:hibernate(?MODULE,wake_hib,[Parent, Name, State, Mod, Debug]); loop(Parent, Name, State, Mod, Time, Debug) -> Msg = receive Input -> Input after Time -> timeout end, decode_msg(Msg, Parent, Name, State, Mod, Time, Debug, false).
能够发现引发崩溃的,358行是一段receive
代码。也就是说receive
是不支持超过2^32
大小的。
自测了一下,的确若是receive
的after
后面若是是大于等于2^32
的数值就会出现bad receive timeout value
的报错。查看官方解释,已经明确说明不能大于32位
大小。
ExprT is to evaluate to an integer. The highest allowed value is 16#FFFFFFFF, that is, the value must fit in 32 bits. receive..after works exactly as receive, except that if no matching message has arrived within ExprT milliseconds, then BodyT is evaluated instead. The return value of BodyT then becomes the return value of the receive..after expression.
引用自:http://erlang.org/doc/reference_manual/expressions.html
再回到erlcron
, 在 ecrn_agent:start_link
的时候,ecrn_agent:init
执行完ecrn_reg:register(JobRef, self())
返回{ok, NewState, Millis}
到gen_server
以后,Millis若是超过2^32
在gen_server:loop
就会引发gen_server
的timeout_value
异常退出。
%% @private init([JobRef, Job]) -> State = #state{job=Job, alarm_ref=JobRef}, {DateTime, Actual} = ecrn_control:datetime(), NewState = set_internal_time(State, DateTime, Actual), case until_next_milliseconds(NewState, Job) of {ok, Millis} when is_integer(Millis) -> ecrn_reg:register(JobRef, self()), {ok, NewState, Millis}; {error, _} -> {stop, normal} end.
这坑踩的,有点郁闷。其实这跟erlcron
也不要紧,也不是gen_server
的问题。而是erlang
自身receive
不支持2^32引发的。继续往下查其实能够发现,再往下是其它语言写的了。
-module(prim_eval). %% This module is simply a stub which abstract code gets included in the result %% of compilation of prim_eval.S, to keep Dialyzer happy. -export(['receive'/2]). -spec 'receive'(fun((term()) -> nomatch | T), timeout()) -> T. 'receive'(_, _) -> erlang:nif_error(stub).
与君共勉