iOS App 使用 GCD 致使的卡顿问题

时间 2019-11-10

标签 ios app 使用 gcd 致使问题栏目 iOS 繁體版

原文原文链接

最近在调研 iOS app 中存在的各类卡顿现象以及解决方法。html

iOS App 出现卡顿（stall）的几率可能超出大部分人的想象，尤为是对于大公司旗舰型 App。一方面是因为业务功能不停累积，各个产品团队之间缺少协调，你们都忙着增长功能，系统资源出现瓶颈。另外一方面的缘由是老设备更新换代太慢，iOS 设备的耐用度极好，如今还有很多 iPhone 4S 在服役，iPhone 6 做为问题设备持有量很高，据估计，如今 iPhone 6s 之前的设备占有比高达 40%。多线程

因此，若是尝试在线上 App 加入卡顿检测的工具，你会发现卡顿出现的几率高的惊人。但卡顿的检测就修复并不简单，主要是由于难以在开发设备上复现。app

以前写过一篇介绍主线程卡顿监控的文章，好像如今主流的作法都是经过监控 Runloop 事件回调，检查进入回调的时间间隔是否超过 Threshold，超过则记录当前 App 全部线程的 call stack。less

我前段时间从后台上报的卡顿日志里看到这样一个 call stack：dom

> 0 libsystem_kernel.dylib __workq_kernreturn
> 1 libsystem_pthread.dylib _pthread_workqueue_addthreads
> 2 libdispatch.dylib _dispatch_queue_wakeup_global_slow
> 3 libdispatch.dylib _dispatch_queue_wakeup_with_qos_slow
> 4 libdispatch.dylib dispatch_async

也就是说卡顿出如今 dispatch_async，以我现有对于 GCD 的认知，dispatch_async 是绝无可能出现卡顿的。dispatch_async 的主要任务是从系统线程池里取出一个工做线程，并将 block 放到该线程里去执行。async

上述 call stack 确确实实的出现了，并且样本数量还很多，最后一个函数明显是一个内核调用。从函数名字猜想，多是 GCD 尝试从线程池里获取线程，但已有线程都在执行状态，因此向系统内核申请建立新的线程。但建立线程的内核调用会很慢吗？会慢到让主线程出现卡顿的程度？带着疑问我搜索了大量相关资料，最后比较相关的有这样一篇文章：http://newosxbook.com/articles/GCD.htmlide

其中有这样一段话：函数

This isn’t due to 10.9’s GCD being different - rather, it demonstrates the true asynchronous nature of GCD: The main thread has yet to return from requesting the worker (which it does by pthread_workqueue_addthreads_np, as I’ll describe later), and already the worker thread has spawned and is mid execution, possibly on another CPU core. The exact state of the main thread with respect to the worker is largely unpredictable.工具

做者认为，GCD 申请到的线程有多是一个正在处理其余任务的 thread，main thread 须要等待这个忙碌的线程返回才能继续执行，我对这种说法存疑。oop

最后求助无门的情况下，我决定使用一次宝贵的 TSL 机会，直接向 Apple 的工程师求教。这里不得不提下，向 Apple 寻求 technical support 是很是宝贵并且可行的方案，每一个开发者帐号每一年都有 2 次机会，不用很是惋惜。

我把问题抛过去后，获得一位 Apple 内核团队工程师的回复，我将精简过的回复以问答的形式展现和你们分享：

Q: looks like even if it’s async dispatching, the main thread still has to wait for the other thread to return, during which time, the other thread happen to be in mid execution of sth. this confuses me, what exactly is the main thread waiting for?

为何主线程须要等待 dispatch_async 返回，主线程到底在等待什么？

A: It’s hard to say with just a user space backtrace. Frame 0 has clearly sent the current thread into the kernel, and this specific kernel call is /way/ too complex to analyse from outside [1].

从用户态调用栈没法得出答案，内核可能的状态过于复杂。

Q: I know it’s suggested that we create limited amount of serial queue，and use target queue probably. but what could happen if we don’t follow that rule?

Apple 一直推荐本身建立 serial GCD queue 的时候，必定要控制数量，并且最好设置 target queue，不然会出现问题，但会出现什么问题我一直很好奇，此次借着机会一块儿问了。

* On macOS, where the system is happier to over commit, you end up with a thread explosion.  That in turn can lead to problems running out of memory, running out of Mach ports, and so on.

* On iOS, which is not happy about over committing, you find that the latency between a block being queued and it running can skyrocket.  This can, in turn, have knock-on effects.  For example, the last time I looked at a problem like this I found that `NSOperationQueue` was dispatching blocks to the global queue for internal maintenance tasks, so when one subsystem within the app consumed all the dispatch worker threads other subsystems would just stall horribly.

Note: In the context of dispatch, an “over commit” is where the system had to allocate more threads to a queue then there are CPU cores.  In theory this should never be necessary because work you dispatch to a queue should never block waiting for resources.  In practice it’s unavoidable because, at a minimum, the work you queue can end up blocking on the VM subsystem.

Despite this, it’s still best to structure your code to avoid the need for over committing, especially when the over commit doesn’t buy you anything.  For example, code like this:

group = dispatch_group_create();
for (url in urlsToFetch) {
    dispatch_group_enter(group);
    dispatch_async(dispatch_get_global_queue(…), ^{
        … fetch `url` synchronously …
        dispatch_group_leave(group);
    });
}
dispatch_group_wait(group, …);

is horrible because it ties up 10 dispatch worker threads for a very long time without any benefit.  And while this is an extreme example — from dispatch’s perspective, networking is /really/ slow — there are less extreme examples that are similarly problematic.  From dispatch’s perspective, even the disk drive is slow (-:

这段回复颇有意思。阅读过 GCD 源码的同窗会知道，全部默认建立的 GCD queue 都有一个优先级，但其实每一个优先级对应两个 queue，好比一个是 default-priority，那么另外一个就是 default-priority-overcommit。dispatch_async 的时候，会首先将任务丢进 default-priority 队列，若是队列满了，就转而丢进 default-priority-overcommit。

在 Mac 系统里，GCD 容许 overcommit，意味着每次 dispatch_async 都会建立一个新线程，即便 over commit 了，这些过量的线程会根据优先级来竞争 CPU 资源。

而在 iOS 系统里，GCD 会控制 overcommit，若是某个优先级队列 over commit 里，那么排在后面的任务就会处于等待状态。移动设备 CPU 资源比较紧张，这种设计合乎常理。

因此若是在 iOS 里建立过多的 serial queue，那么后面提交的任务可能就会一直处于等待状态。这也是为何咱们须要严格控制 queue 的数量和层级关系，最好是 App 当中每一个子系统只能分配固定数量和优先级的 queue，从而避免 thread explosion 致使的代码没法及时执行问题。

Q：I know the system watchdog can kill an app if the main thread is taking too long to respond. I also heard rumors that there are two other cases that may gets your app killed by watchdog. the first is too many new threads are being created like by random usage of dispatching work to global concurrent queue? the second case is if CPU has been kept too busy like 100% for too long, watchdog kills app too?

我借机问了下系统 watchdong 强杀 App 的缘由，由于坊间一直有传闻是除了主线程长时间没反应以外，建立过多的线程和 CPU 长时间超负荷运转也会致使被强杀。

A：I’m not aware of any specific watchdog check along those lines, but it’s not hard to imagine that the above-mentioned knock-on effects might jam up your app sufficiently for the watchdog to kill it for other reasons. Running the CPU for too long generates a crash report but it doesn’t actually kill the app. It’s essentially a ‘warning’ crash report about the problem.

建立过多线程不会直接致使 watchdog 强杀，但过多线程有可能致使主线程得不到及时处理，而由于其余缘由被 kill。而 CPU 长时间过载并不会致使强杀，但系统会生成一个 report 来警告开发者。我确实看到过很多这类 ‘this is not a crash’ 的 crash 日志。

另外还有一些问答，和我当前疑问并不直接相关因此略去。最后再贴一段比较有意思的回复，在阅读以前你们能够本身先思考下：

dispatch_async(myQueue, ^{    
// line A
});
// line B

line A 和 line B 谁先执行？

Consider a snippet like this:

dispatch_async(myQueue, ^{
    // line A
});
// line B

there’s clearly a race condition between lines A and B, that is, between the `dispatch_async` returning and the block running on the queue.  This can pan out in multiple ways, including:

* If `myQueue` (which we’re assuming is a serial queue) is busy, A has to wait so B will definitely run before A.

* If `myQueue` is empty, there’s no idle CPU, and `myQueue` has a higher priority then the thread that called `dispatch_async`, you could imagine the kernel switching the CPU to `myQueue` so that it can run A.

* The thread that called `dispatch_async` could run out of its time quantum after scheduling B on `myQueue` but before returning from `dispatch_async`, which again results in A running before B.

* If `myQueue` is empty and there’s an idle CPU, A and B could end up running simultaneously.

答案

其实最后我也没有获得我想要的准确的答案，可能正如回复里所说，状况有不少并且过于复杂，无法经过一个用户态的 call stack 简单推知内核的状态，但有些有价值的信息仍是得以大体理清：

信息一

iOS 系统自己是一个资源调度和分配系统，CPU，disk IO，VM 等都是稀缺资源，各个资源之间会互相影响，主线程的卡顿看似 CPU 资源出现瓶颈，但也有可能内核忙于调度其余资源，好比当前正在发生大量的磁盘读写，或者大量的内存申请和清理，都会致使下面这个简单的建立线程的内核调用出现卡顿：

libsystem_kernel.dylib __workq_kernreturn

因此解决办法只能是本身分析各 thread 的 call stack，根据用户场景分析当前正在消耗的系统资源。后面也确实经过最近提交的代码分析，发现是因为增长了一些很是耗时的磁盘 io 任务（虽然也是放在在子线程），才出现这个看着不怎么沾边的 call stack。revert 以后卡顿警报就消失了。

信息二

现有的卡顿检测工具都只能在超时的状况下 dump call stack，但出现超时有多是任务 A，B，C 共同做用致使的，A 和 B 多是真正耗时的任务，C 不耗时但碰巧是最后一个，因此被当成元凶，而 A 和 B 却没有出如今上报日志里。我暂时也没有想到特别好的解决办法。很明显，libsystem_kernel.dylib __workq_kernreturn 就是一个不怎么耗时的 C 任务。

信息三

在使用 GCD 建立 queue，或者说一个 App 内部使用 GCD 执行子线程任务时，最好有一套 App 全部团队都能遵循的队列使用机制，避免建立过多的 thread，而出现意料以外的线程资源紧缺，代码没法及时执行的状况。这很难，尤为是在大公司动则上百人的团队里面。

https://mp.weixin.qq.com/s?__biz=MzI5MjEzNzA1MA==&mid=2650264622&idx=1&sn=245f8f13c28a33a7cca7724943972f9f&mid=4245286622720958