Own Your Crash Logs with PLCrashReporter: Part 3
iOSIn the third post of our PLCrashReporter series, we will walk through how to get PLCrashReporter up and running on your machine. And make...
This is part two of our PLCrashReporter series. In this post, we will examine how crashes are created and learn more about specific crash types.
A crash handler has a three-phase life cycle:
Next, let’s look closer at the first phase. How does one prepare to intercept crashes? To answer that, we need to look at how crashes are propagated. On Apple platforms, there are two pathways via which application crashes flow: POSIX signals and Mach exceptions.
First up are POSIX signals. When an illegal instruction or a request for termination occurs, the kernel sends a POSIX signal to the offending thread. These signals have a shortlist of usual suspects:
SIGSEGV
– Memory errorsSIGILL
– Illegal instructionsSIGABRT
– Usually when the process itself calls abort()
SIGKILL
– For example when you issue the killall -9
command in a shell. That “9” is the value of SIGKILL.Once that signal is delivered, the OS terminates the process. Between when the signal is sent and the process is terminated, any signal handlers that were registered for the process are given an opportunity to respond.
Apple’s operating systems run atop a Darwin kernel. Darwin is a descendant of the Mach kernel. Mach kernels, much like Darwin, use exception messages, rather than POSIX signals, to communicate about unexpected errors in the program flow. Mach exceptions are messages sent over IPC ports, which can be subscribed to by any interested observer with sufficient permissions for the process they’re interested in.
On Darwin, Mach exception messages are actually the underlying mechanism beneath the implementation of POSIX signals. Darwin registers a Mach exception handler that reflects Mach exceptions into POSIX signals. This is why, for example, when you look at a memory access error crash the description of the crash includes EXC_BAD_ACCESS (SIGSEGV)
, the Mach exception and the POSIX signal respectively.
It is possible for your app to register its own Mach exception server, but a thorough exception server implementation requires use of undocumented or private API and is fraught with even more peril than writing your own POSIX signal handler.
When writing a custom crash logger, you have to decide which of these mechanisms you will use to intercept crashes. While this might seem like a purely academic choice, there is at least one salient edge case: POSIX signal handlers are run on the crashed thread, not a separate thread, thus using the same stack as the crashed thread. If that thread encountered a stack overflow, there will by definition be no available space on top of the stack for your signal handler to be executed, and you will be unable to capture the crash. A Mach exception handler is immune to this edge case because your handler — or, more accurately, your exception server — is listening for exceptions on a dedicated thread, which likely has enough room on its call stack to execute crash-handling code.
PLCrashReporter can use either POSIX signals or Mach exceptions, but the authors strenuously recommend against using a Mach exception server in production code.
Within your crash-handling code, there are truly profound limitations on what APIs you are able to use. There is almost nothing available to you. There is a shortlist of what are called “async-signal-safe” APIs that can be used within a signal handler. Because the heap could be corrupted, they can only use stack memory. Fortunately, they include essential file-system functions like open(2)
and write(2)
, which is how a custom crash handler is able to save a crash log to disk.
The “async” part is misleading from the perspective of a practicing iOS or macOS developer. It doesn’t mean they’re safe for concurrent access from multiple threads. Rather, an API is considered async-signal-safe if it is guaranteed to be fully re-entrant. A crash could occur at any moment during program execution, including somewhere within a call to a function that your crash handler might need to call itself! If your crash handler then called that same function during the course of handling the crash, and that function isn’t async-signal-safe, your crash handler might deadlock, leading to a lost crash log. You need async-signal-safe turtles all the way down.
Some additional things you cannot do because they aren’t async-signal-safe:
malloc
isn’t async-signal-safe, and also because the heap might be corrupted. Any memory your crash handler needs to perform its duties must be allocated during app launch when the handler is first registered. Its memory budget is thus fixed and predetermined, even though a crash log could contain any amount of information. Consider how many megabytes of data could be in a single crash log if there are deep call stacks and hundreds of libraries.malloc
.Putting this all together, to write a proper crash reporter requires:
By now, you should be sufficiently terrified of writing your own crash logger. If you aren’t, you’re braver than most, or you weren’t paying attention. The rest of us are fortunate that there are existing implementations.
A reliable, well-maintained, open-source library for capturing crashes on Apple platforms is PLCrashReporter. It has changed hands a few times as its ownership hopped from Plausible Labs to Hockey App to Microsoft, but the fundamental design of the library remains the same.
The next post in our series will walk you through adding the PLCrashReporter library to an application so that you can obtain crash logs directly on your device without having to resort to a third party service.
[2]: Landon Fuller, the primary author of PLCrashReporter, gives an example of how allowing program execution to continue after the crash can corrupt or destroy user data in the section “Failure Case: Async-Safety and Data Corruption” of Reliable Crash Reporting.
In the third post of our PLCrashReporter series, we will walk through how to get PLCrashReporter up and running on your machine. And make...
In this four-part series, we will first dive deep into crashes and will provide you with a step-by-step tutorial on how to get PLCrashReporter...
GraphQL can revolutionize your product by improving performance, reducing friction between development teams, and even helping with documentation. But it's important to understand the...