从源码角度看traces.txt是如何生成的

简介

原文请看个人博客:从源码角度看traces.txt是如何生成的java

traces.txt 位于安卓系统下/data/anr目录下,当系统中有应用出现ANR时,framework会在弹出Dialog的同时dump出当前各线程的堆栈状况,方便开发者分析出ANR的root cause。android

ANR是Application Not Responsing 的简称,简而言之,就是安卓系统内置提示用户应用界面没有反应的机制,是用来避免应用界面一直卡顿,增长系统用户友好度的一种方式。c++

形成ANR的缘由不少,都是由于在主线程执行任务过久阻塞了界面更新致使的,主要有如下几类:缓存

  1. Broadcast Timeout: 前台广播执行超过10s, 后台广播执行超过60s (要注意的是,只有串行的广播才有超时机制,并行的广播并不会超时,也就是说,若是广播是动态注册的,直接调用sendBroadcast触发,若是主线程Looper中排在后面的Message不会触发超时机制,那么即时这个广播是前台广播,系统也永远不会弹出框提示用户超时了)
  2. Service Timeout: 前台服务之星超过20s, 后台服务之星超过200s
  3. Provider Timeout: 内容提供者,publish超过10s
  4. Input Timeout: 按键触摸事件分发超过5s

可以形成ANR的前提是任务是在主线程上执行的,执行什么样的任务主要有如下几点:bash

  1. 执行耗时任务太久,如文件读取或存储,网络访问获取文件太耗时
  2. 线程被阻塞太久,或者干脆出现了死锁
  3. 线程饥饿,如Binder线程总共16个,Binder主线程占有一个,剩余的15个工做线程都被占满
  4. CPU饥饿,负载值过大,虽然代码正常额但任务一直没有来得及执行

那么回到traces.txt文件,它到底包含了什么信息能够帮助开发者找到ANR问题的根源呢。网络

在这篇文章,我先从一份traces.txt的日志实例开始解析,而后再经过追踪源码来解释traces.txt是如何生成的。app

我是基于Android 5.0原生代码进行分析的,最新的一些代码逻辑放到了AppErrors类中了,有些小差异,可是大致的逻辑和步骤没有变化,不会影响阅读。ide

日志实例

// 触发ANR的应用pid及触发时间
----- pid 30307 at 2017-09-03 14:51:14 -----
Cmd line: com.example.androidyue.bitmapdemo

JNI: CheckJNI is off; workarounds are off; pins=0; globals=272

DALVIK THREADS:
(mutexes: tll=0 tsl=0 tscl=0 ghl=0)

// JAVA 态线程状态

// 线程名称; 优先级; 线程id; 线程状态
"main" prio=5 tid=1 TIMED_WAIT
  // 线程组; 线程suspend计数; 线程debug suspend计数; 同辈线程对象; 当前线程对象
  | group="main" sCount=1 dsCount=0 obj=0x416eaf18 self=0x416d8650
  // 线程id; 进程优先级; 调度者优先级; 调度者组名; 线程缓存句柄
  | sysTid=30307 nice=0 sched=0/0 cgrp=apps handle=1074565528
  // native线程状态; 调度者状态; 执行用户代码的总时间; 执行系统代码的总时间; 任务cpu使用数量
  | state=S schedstat=( 0 0 0 ) utm=5 stm=4 core=3
  // 线程调用栈
  at java.lang.VMThread.sleep(Native Method)
  at java.lang.Thread.sleep(Thread.java:1044)
  at java.lang.Thread.sleep(Thread.java:1026)
  at com.example.androidyue.bitmapdemo.MainActivity$1.run(MainActivity.java:27)
  at android.app.Activity.runOnUiThread(Activity.java:4794)
  at com.example.androidyue.bitmapdemo.MainActivity.onResume(MainActivity.java:33)
  at android.app.Instrumentation.callActivityOnResume(Instrumentation.java:1282)
  at android.app.Activity.performResume(Activity.java:5405)

// Native态线程状态

"Binder_5" prio=5 tid=32 NATIVE
  | group="main" sCount=1 dsCount=0 obj=0x41dd6570 self=0x678fa458
  | sysTid=29473 nice=0 sched=0/0 cgrp=apps handle=1732666432
  | state=S schedstat=( 578000 10600000 3 ) utm=0 stm=0 core=0
  #00 pc 00016fe4 /system/lib/libc.so (__ioctl+8)
  #01 pc 0002a97d /system/lib/libc.so (ioctl+16)
  #02 pc 00016ba1 /system/lib/libbinder.so (android::IPCThreadState::talkWithDriver(bool)+132)
  #03 pc 00017363 /system/lib/libbinder.so (android::IPCThreadState::joinThreadPool(bool)+154)
  #04 pc 0001b15d /system/lib/libbinder.so
  #05 pc 00011267 /system/lib/libutils.so (android::Thread::_threadLoop(void*)+114)
  #06 pc 0004679f /system/lib/libandroid_runtime.so (android::AndroidRuntime::javaThreadShell(void*)+66)
  #07 pc 00010dcd /system/lib/libutils.so
  #08 pc 0000e3d8 /system/lib/libc.so (__thread_entry+72)
  #09 pc 0000dac4 /system/lib/libc.so (pthread_create+160)
  at dalvik.system.NativeStart.run(Native Method)复制代码

关键词对照

关键词 解释
prio 优先级
tid 线程id
group 组名
sCount 挂起次数
dsCount 调试下 挂起计数
obj 同辈线程对象
self 当前线程对象
sysTid 线程id
nice 进程优先级
sched 调度者优先级
cgrp 调度者组名
handle 线程缓存句柄
state native线程状态
schedstat 调度者状态
utm 用户态CPU时间
stm 内核态CPU时间
core 运行所在核

线程状态对照

JAVA状态 CPP状态 解释
TERMINATED ZOMBIE 线程死亡
RUNNABLE RUNNING/RUNNABLE 可运行或正在运行
TIMED_WAITING TIMED_WAIT 执行了带有超时参数的wait, sleep或join函数
BLOCKED MONITOR 线程阻塞,等待获取对象锁
WAITING WAIT 执行了无超时参数的wait
NEW INITIALIZING 新建,正在初始化中
NEW STARTING 新建,正在启动
RUNNABLE NATIVE 正在执行JNI本地函数
WAITING VMWAIT 正在等待VM资源
RUNNABLE SUSPENDED 线程暂停,一般因为GC或者DEBUG被挂起

源码分析 (虚拟机)

我将从代码的最底层开始分析,再慢慢到触发点的顶端,先从cpp开始函数

Thread::DumpState

thread.ccoop

void Thread::DumpState(std::ostream& os, const Thread* thread, pid_t tid) {
  // 组名
  std::string group_name;
  // 优先级
  int priority;
  // 是不是daemon线程
  bool is_daemon = false;
  // 拿到当前线程对象指针
  Thread* self = Thread::Current();

  if (thread != nullptr) {
    ...
  }


  if (gAborting == 0 && self != nullptr && thread != nullptr && thread->tlsPtr_.opeer != nullptr) {
    ScopedObjectAccessUnchecked soa(self);
    priority = soa.DecodeField(WellKnownClasses::java_lang_Thread_priority)
        ->GetInt(thread->tlsPtr_.opeer);
    is_daemon = soa.DecodeField(WellKnownClasses::java_lang_Thread_daemon)
        ->GetBoolean(thread->tlsPtr_.opeer);

    ...

    if (thread_group != nullptr) {
      ...
      group_name = (group_name_string != nullptr) ? group_name_string->ToModifiedUtf8() : "<null>";
    }
  } else {
    priority = GetNativePriority();
  }

  std::string scheduler_group_name(GetSchedulerGroupName(tid));
  if (scheduler_group_name.empty()) {
    scheduler_group_name = "default";
  }

  // 这里开始输出线程状态
  if (thread != nullptr) {
    // 线程名输出
    os << '"' << *thread->tlsPtr_.name << '"';
    if (is_daemon) {
      os << " daemon";
    }
    // 优先级; ID; 状态
    os << " prio=" << priority
       << " tid=" << thread->GetThreadId()
       << " " << thread->GetState();
    if (thread->IsStillStarting()) {
      os << " (still starting up)";
    }
    os << "\n";
  } else {
    os << '"' << ::art::GetThreadName(tid) << '"'
       << " prio=" << priority
       << " (not attached)\n";
  }

  // 线程组; 线程suspend计数; 线程debug suspend计数; 同辈线程对象; 当前线程对象
  if (thread != nullptr) {
    MutexLock mu(self, *Locks::thread_suspend_count_lock_);
    os << "  | group=\"" << group_name << "\""
       << " sCount=" << thread->tls32_.suspend_count
       << " dsCount=" << thread->tls32_.debug_suspend_count
       << " obj=" << reinterpret_cast<void*>(thread->tlsPtr_.opeer)
       << " self=" << reinterpret_cast<const void*>(thread) << "\n";
  }

  // 线程id; 进程优先级; 调度者组名
  os << "  | sysTid=" << tid
     << " nice=" << getpriority(PRIO_PROCESS, tid)
     << " cgrp=" << scheduler_group_name;

  // 调度者优先级; 缓存句柄
  if (thread != nullptr) {
    int policy;
    sched_param sp;
    CHECK_PTHREAD_CALL(pthread_getschedparam, (thread->tlsPtr_.pthread_self, &policy, &sp),
                       __FUNCTION__);
    os << " sched=" << policy << "/" << sp.sched_priority
       << " handle=" << reinterpret_cast<void*>(thread->tlsPtr_.pthread_self);
  }
  os << "\n";

  // Grab the scheduler stats for this thread.
  std::string scheduler_stats;
  if (ReadFileToString(StringPrintf("/proc/self/task/%d/schedstat", tid), &scheduler_stats)) {
    scheduler_stats.resize(scheduler_stats.size() - 1);  // Lose the trailing '\n'.
  } else {
    scheduler_stats = "0 0 0";
  }

  char native_thread_state = '?';
  int utime = 0;
  int stime = 0;
  int task_cpu = 0;
  GetTaskStats(tid, &native_thread_state, &utime, &stime, &task_cpu);

  // native线程状态; 调度者状态; 执行用户代码的总时间; 执行系统代码的总时间; 任务cpu使用数量
  os << "  | state=" << native_thread_state
     << " schedstat=( " << scheduler_stats << " )"
     << " utm=" << utime
     << " stm=" << stime
     << " core=" << task_cpu
     << " HZ=" << sysconf(_SC_CLK_TCK) << "\n";
  if (thread != nullptr) {
    // dump 线程当前栈
    os << "  | stack=" << reinterpret_cast<void*>(thread->tlsPtr_.stack_begin) << "-"
        << reinterpret_cast<void*>(thread->tlsPtr_.stack_end) << " stackSize="
        << PrettySize(thread->tlsPtr_.stack_size) << "\n";
    // dump 线程锁信息
    os << "  | held mutexes=";
    for (size_t i = 0; i < kLockLevelCount; ++i) {
      if (i != kMonitorLock) {
        BaseMutex* mutex = thread->GetHeldMutex(static_cast<LockLevel>(i));
        if (mutex != nullptr) {
          os << " \"" << mutex->GetName() << "\"";
          if (mutex->IsReaderWriterMutex()) {
            ReaderWriterMutex* rw_mutex = down_cast<ReaderWriterMutex*>(mutex);
            if (rw_mutex->GetExclusiveOwnerTid() == static_cast<uint64_t>(tid)) {
              os << "(exclusive held)";
            } else {
              os << "(shared held)";
            }
          }
        }
      }
    }
    os << "\n";
  }
}复制代码

GetTaskStats

utils.cc

void GetTaskStats(pid_t tid, char* state, int* utime, int* stime, int* task_cpu) {
  *utime = *stime = *task_cpu = 0;
  std::string stats;
  // 从stat节点获取数据
  if (!ReadFileToString(StringPrintf("/proc/self/task/%d/stat", tid), &stats)) {
    return;
  }
  // Skip the command, which may contain spaces.
  stats = stats.substr(stats.find(')') + 2);
  // Extract the three fields we care about.
  std::vector<std::string> fields;
  Split(stats, ' ', &fields);
  *state = fields[0][0];
  *utime = strtoull(fields[11].c_str(), nullptr, 10);
  *stime = strtoull(fields[12].c_str(), nullptr, 10);
  *task_cpu = strtoull(fields[36].c_str(), nullptr, 10);
}复制代码

DumpUnattachedThread

thread_list.cc

// 读取该线程中的全部状态信息:JAVA态,Native态,Kernel态
static void DumpUnattachedThread(std::ostream& os, pid_t tid) NO_THREAD_SAFETY_ANALYSIS {
  // [接上]
  Thread::DumpState(os, nullptr, tid);
  DumpKernelStack(os, tid, " kernel: ", false);

  if (false) {
    DumpNativeStack(os, tid, " native: ");
  }
  os << "\n";
}

void ThreadList::DumpUnattachedThreads(std::ostream& os) {
  DIR* d = opendir("/proc/self/task");
  if (!d) {
    return;
  }

  Thread* self = Thread::Current();
  dirent* e;
  // 获取当前进程中全部线程,dump这些线程的信息
  while ((e = readdir(d)) != nullptr) {
    char* end;
    pid_t tid = strtol(e->d_name, &end, 10);
    if (!*end) {
      bool contains;
      {
        MutexLock mu(self, *Locks::thread_list_lock_);
        contains = Contains(tid);
      }
      if (!contains) {
        // [接上]
        DumpUnattachedThread(os, tid);
      }
    }
  }
  closedir(d);
}

bool ThreadList::Contains(Thread* thread) {
  return find(list_.begin(), list_.end(), thread) != list_.end();
}复制代码

DumpForSigQuit

thread_list.cc

void ThreadList::DumpForSigQuit(std::ostream& os) {
  ...
  DumpUnattachedThreads(os);
}复制代码

runtime.cc

void Runtime::DumpForSigQuit(std::ostream& os) {
  GetClassLinker()->DumpForSigQuit(os);
  GetInternTable()->DumpForSigQuit(os);
  GetJavaVM()->DumpForSigQuit(os);
  GetHeap()->DumpForSigQuit(os);
  TrackedAllocators::Dump(os);
  os << "\n";

  // [接上]
  // dump 线程列表
  thread_list_->DumpForSigQuit(os);
  BaseMutex::DumpAll(os);
}复制代码

HandleSigQuit

singal_catcher.cc

void SignalCatcher::HandleSigQuit() {
  Runtime* runtime = Runtime::Current();
  std::ostringstream os;
  // 输出进程id以及当前dump时间
  os << "\n"
      << "----- pid " << getpid() << " at " << GetIsoDate() << " -----\n";

  DumpCmdLine(os);

  // 输出系统信息
  std::string fingerprint = runtime->GetFingerprint();
  os << "Build fingerprint: '" << (fingerprint.empty() ? "unknown" : fingerprint) << "'\n";
  os << "ABI: '" << GetInstructionSetString(runtime->GetInstructionSet()) << "'\n";

  os << "Build type: " << (kIsDebugBuild ? "debug" : "optimized") << "\n";

  // [接上]
  runtime->DumpForSigQuit(os);

  ...
  os << "----- end " << getpid() << " -----\n";
  Output(os.str());
}复制代码

SignalCatcher::Run

void* SignalCatcher::Run(void* arg) {
...
  // Set up mask with signals we want to handle.
  SignalSet signals;
  signals.Add(SIGQUIT);
  signals.Add(SIGUSR1);

  while (true) {
    // 等待信号
    int signal_number = signal_catcher->WaitForSignal(self, signals);
...

    switch (signal_number) {
    case SIGQUIT:
      // [接上]
      signal_catcher->HandleSigQuit();
      break;
    case SIGUSR1:
      signal_catcher->HandleSigUsr1();
      break;
    default:
      LOG(ERROR) << "Unexpected signal %d" << signal_number;
      break;
    }
  }
}

int SignalCatcher::WaitForSignal(Thread* self, SignalSet& signals) {
...
  int signal_number = signals.Wait();
...
  return signal_number;
}复制代码

源码分析 (framework)

可见,android framework经过发送信号3来通知虚拟机来进行dump到traces.txt文件中的

dumpStackTraces

ActivityManagerService.java

private static void dumpStackTraces(String tracesPath, ArrayList<Integer> firstPids, ProcessCpuTracker processCpuTracker, SparseArray<Boolean> lastPids, String[] nativeProcs) {
    // 当trace写完后,监听并触发onEvent方法
    FileObserver observer = new FileObserver(tracesPath, FileObserver.CLOSE_WRITE) {
        @Override
        public synchronized void onEvent(int event, String path) {
            notify();
        }
    };

    try {
        // 开始监听
        observer.startWatching();

        // 首先收集firstPids链表里的进程信息
        if (firstPids != null) {
            try {
                int num = firstPids.size();
                for (int i = 0; i < num; i++) {
                    synchronized (observer) {
                        // 发送信号3来触发
                        Process.sendSignal(firstPids.get(i), Process.SIGNAL_QUIT);
                        observer.wait(200);  // Wait for write-close, give up after 200msec
                    }
                }
            } catch (InterruptedException e) {
                Slog.wtf(TAG, e);
            }
        }

        // 收集native进程信息
        ...

        // 最后收集CPU信息
        ...
    } finally {
        observer.stopWatching();
    }
}复制代码
public static File dumpStackTraces(boolean clearTraces, ArrayList<Integer> firstPids, ProcessCpuTracker processCpuTracker, SparseArray<Boolean> lastPids, String[] nativeProcs) {
    // 获取trace文件路径
    String tracesPath = SystemProperties.get("dalvik.vm.stack-trace-file", null);
...

    File tracesFile = new File(tracesPath);
    try {
...
        FileUtils.setPermissions(tracesDir.getPath(), 0775, -1, -1);  // drwxrwxr-x

        if (clearTraces && tracesFile.exists()) tracesFile.delete();
        tracesFile.createNewFile();
        FileUtils.setPermissions(tracesFile.getPath(), 0666, -1, -1); // -rw-rw-rw-
    } catch (IOException e) {
        Slog.w(TAG, "Unable to prepare ANR traces file: " + tracesPath, e);
        return null;
    }

    // [接上]
    dumpStackTraces(tracesPath, firstPids, processCpuTracker, lastPids, nativeProcs);
    return tracesFile;
}复制代码

appNotResponding

ActivityManagerService.java

final void appNotResponding(ProcessRecord app, ActivityRecord activity, ActivityRecord parent, boolean aboveSystem, final String annotation) {
    ArrayList<Integer> firstPids = new ArrayList<Integer>(5);
    SparseArray<Boolean> lastPids = new SparseArray<Boolean>(20);
...
    final ProcessCpuTracker processCpuTracker = new ProcessCpuTracker(true);

    // [接上]
    File tracesFile = dumpStackTraces(true, firstPids, processCpuTracker, lastPids,
            NATIVE_STACKS_OF_INTEREST);
...
    Slog.e(TAG, info.toString());
    if (tracesFile == null) {
        // There is no trace file, so dump (only) the alleged culprit's threads to the log
        Process.sendSignal(app.pid, Process.SIGNAL_QUIT);
    }

    addErrorToDropBox("anr", app, app.processName, activity, parent, annotation,
            cpuInfo, tracesFile, null);
...
}复制代码

触发ANR的方式上面我已经列举过了,这里我以触发Broadcast Timeout时进行分析

AppNotResponding.run

BroadcastQueue.java

private final class AppNotResponding implements Runnable {
    private final ProcessRecord mApp;
    private final String mAnnotation;

    public AppNotResponding(ProcessRecord app, String annotation) {
        mApp = app;
        mAnnotation = annotation;
    }

    @Override
    public void run() {
        // [接上]
        mService.appNotResponding(mApp, null, null, false, mAnnotation);
    }
}复制代码

broadcastTimeoutLocked

BroadcastQueue.java

final void broadcastTimeoutLocked(boolean fromMsg) {
...
    String anrMessage = null;
...
    if (app != null) {
        anrMessage = "Broadcast of " + r.intent.toString();
    }
...
    if (anrMessage != null) {
        // [接上]
        mHandler.post(new AppNotResponding(app, anrMessage));
    }
}复制代码

BroadcastHandler.handleMessage

private final class BroadcastHandler extends Handler {
    public BroadcastHandler(Looper looper) {
        super(looper, null, true);
    }

    @Override
    public void handleMessage(Message msg) {
        switch (msg.what) {
...
            case BROADCAST_TIMEOUT_MSG: {
                synchronized (mService) {
                    // [接上]
                    broadcastTimeoutLocked(true);
                }
            } break;
...
        }
    }
};复制代码

setBroadcastTimeoutLocked

BroadcastQueue.java

final void setBroadcastTimeoutLocked(long timeoutTime) {
    if (! mPendingBroadcastTimeoutMessage) {
        Message msg = mHandler.obtainMessage(BROADCAST_TIMEOUT_MSG, this);
        mHandler.sendMessageAtTime(msg, timeoutTime);
        mPendingBroadcastTimeoutMessage = true;
    }
}

final void processNextBroadcast(boolean fromMsg) {
    if (! mPendingBroadcastTimeoutMessage) {
        long timeoutTime = r.receiverTime + mTimeoutPeriod;
        // [接上]
        setBroadcastTimeoutLocked(timeoutTime);
    }
}复制代码

总结

下面总结一下触发广播ANR后输出traces.txt的过程:

  1. 有序广播在注册后发送
  2. AMS设置广播超时时间,经过Handler进行延迟设定
  3. 广播超时触发超时机制
  4. Framework层发送信号给虚拟机
  5. 虚拟机dump出特定进程全部线程信息并存储到traces.txt文件中
相关文章
相关标签/搜索