某个服务节点在较低的qps(每秒2000次数据库访问)下, 在worker进程数100, max_overflow进程数100的状况下. 忽然性能降低, 每秒只能处理1500次数据库访问. 致使请求处理延时从几MS上升至几百MS, 以后又逐渐恢复.mongodb
逐渐把范围缩小至 mongodb poolboy 进程池的 checkout:数据库
handle_call({checkout, CRef, Block}, {FromPid, _} = From, State) -> #state{supervisor = Sup, workers = Workers, monitors = Monitors, overflow = Overflow, max_overflow = MaxOverflow} = State, case Workers of [Pid | Left] -> MRef = erlang:monitor(process, FromPid), true = ets:insert(Monitors, {Pid, CRef, MRef}), {reply, Pid, State#state{workers = Left}}; [] when MaxOverflow > 0, Overflow < MaxOverflow -> {Pid, MRef} = new_worker(Sup, FromPid), true = ets:insert(Monitors, {Pid, CRef, MRef}), {reply, Pid, State#state{overflow = Overflow + 1}}; [] when Block =:= false -> {reply, full, State}; [] -> MRef = erlang:monitor(process, FromPid), Waiting = queue:in({From, CRef, MRef}, State#state.waiting), {noreply, State#state{waiting = Waiting}} end;
能够看到, 当max_overflow不为0时, 瞬间过载会建立新的worker, 而这些worker, 都会去连接mongodb, 耗时1-2MS. 建立的消耗会阻塞master process.性能
而归还时, 又会将worker销毁, 致使连接一直建立/销毁, 并且都卡在master process, 这致使全部的请求, 都会因master process的连接建立和销毁而阻塞, 致使qps雪崩降低.code
handle_checkin(Pid, State) -> #state{supervisor = Sup, waiting = Waiting, monitors = Monitors, overflow = Overflow, strategy = Strategy} = State, case queue:out(Waiting) of {{value, {From, CRef, MRef}}, Left} -> true = ets:insert(Monitors, {Pid, CRef, MRef}), gen_server:reply(From, Pid), State#state{waiting = Left}; {empty, Empty} when Overflow > 0 -> ok = dismiss_worker(Sup, Pid), State#state{waiting = Empty, overflow = Overflow - 1}; {empty, Empty} -> Workers = case Strategy of lifo -> [Pid | State#state.workers]; fifo -> State#state.workers ++ [Pid] end, State#state{workers = Workers, waiting = Empty, overflow = 0} end.
不要使用 poolboy 的 max_overflow, 若建立/销毁 children process时有必定消耗, 很容易阻塞 poolboy master进程, 频繁建立/销毁 worker 致使雪崩.server
每次查BUG, 回头看来都是理所固然. 追查时却要费一番心思, 监控数据不便在我的blog给出. 难免省掉不少推断过程, 但愿这个结论对你们有帮助.blog