Over the past several years the barrier to entry for adopting machine learning and computer vision-related features has dropped substantially. Still, many developers are...
The Universal Troubleshooting Process
Last week I walked through finding and fixing My Favorite Bug. Observant readers may have noticed a multi-step process outlined by some <H3> tags. Get the Attitude, Do Corrective Maintenance, and so on. What was that?
It’s called the Universal Troubleshooting Process. The UTP was invented and copyrighted by Steve Litt in the 90’s. Prior to doing software, Steve used to fix stereo equipment, like when people would bring in tube amps that didn’t work any more. (Does anyone remember tube amps? Does anyone actually remember repairing electronics?) You can flail around until you stumble on a solution when faced with an problem of unknown cause and complexity, or you can methodically close in on what’s going wrong. Turns out that this kind of process can be applied to just about anything. Fixing stereos. Fixing cars. Fixing software.
I don’t use the UTP for every bug I sit down to fix. Most day-to-day debugging is simple enough, or localized enough (hey! double-check the new code!) that a full-blown process isn’t necessary. But if I know I’m faced with a potentially big, time-critical fix, I crank up the UTP.
Here’s my interpretation of the process.
Step 1 – Prepare
Back when I learned the UTP, this step was called “Get the Attitude.” This is where you psych yourself up before diving into the problem, like the simple mantra “yeah, I can do this.” The cause of any problem can be found. It might take awhile. You might not be able to fix it due to technical or business reasons, but you can at least identify what’s going wrong.
If you don’t have the attitude right now, say you’re recovering from an all-night Jaegermeister bender or you’re about to start a week-long offsite in Bermuda, you might want to postpone starting your debugging session. If it’s something I know will be big and nasty, I might procrastinate a bit an clean up my desk (which is usually a disaster area) so I won’t be working in a jumbled, chaotic environment.
Step 2 – Make a Damage Control Plan
This is where you figure out what can go wrong while you’re fixing things, and make a plan on how to mitigate it. If you’re going to be troubleshooting a live production database, you’ll want to make sure you have backups, and your backups are good in case you accidentally trash things. Or, even better, see if you can reproduce the system with a staging database that you can easily repopulate. If you’re dealing with a physical system, make sure you’re not endangering yourself or those around you. If you’re dealing with delicate electronics, make sure you’re not going to be able to short it to the mains.
I apply the idea of “damage” control to business decisions as well. Are there any business factors which affect how long this fix can take, such as the unveiling of the product at a conference. At what point does the cost of my time turn into diminishing returns for the severity of the bug.
Step 3 – Get a complete and accurate symptom description
Programmers break into laughter when they hear this step’s description because it never happens, at least to start out with. You frequently start out with a bug report written by Pakleds : “The App crashes :-(“. After some back-and-forth, you finally isolate the problem to “The app crashes when I command-triple-click on a paragraph that has mixed bold and italic text.” The first description is pretty much impossible to work from. You probably can zero in to the offending code with the second one.
I like using a screen recorder, such as Screen Flow or QuickTime’s recorder, when I make bug reports. There are a couple of advantages to submitting a video of the problem in-action. The most important is that you have actual evidence of the system malfunctioning. This makes it harder to dismiss it as the user smoking their socks.
A screen recording gives the developer looking at the video much more information than a simple text description. There may be some clues in the user’s behavior – are they clicking frantically, or are they relatively lethargic? A fast clicker might hit a race condition that a normal clicker might not. Maybe the user is in an unexpected mode without realizing in. There may be some UI oddities like “huh, why is that icon highlighted in the corner right now?” that could give some insight into the problem.
Screen recording isn’t just for desktop Macs. A tool like Reflector will show an iOS screen on the desktop where it can be recorded. It won’t capture all aspects of the app’s display (such as locations of touches, particularly fast animations, or parallax effects) but chances are you’ll have more useful data than a simple text bug report.
Step 4 – Reproduce the symptom
Once you can reliably reproduce a bug, it’s dead. Even then, if it’s intermittent, don’t give up. You may eventually be able to recharacterize the problem into something reproducible.
Very important: be consistent with your test data and your reproduce-the-bug steps! I have encountered anomalous software behavior that was caused by several bugs conspiring together. You could be dividing your attention amongst distinct problems if you’re lackadaisical with your test data, and ultimately make no headway on solving your problem.
Step 5 – Do appropriate Corrective Maintenance
This is the stuff you do to prevent yourself from feeling really stupid after you fix the problem. Electrical problems in a car? Make sure the battery terminals are clean and the battery holds a charge. Acceleration issues? Make sure the floor mat isn’t under the gas pedal. (Yes, this happened to me before.) Amplifier on the fritz? Make sure all the tubes are seated. Database performance problems? Make sure the database has been vacuumed and analyzed recently.
For Mac and iOS software, run the Xcode static analyzer, especially if you’re being called into someone else’s project to help. Fix your warnings.
Check the hardware, especially cables to peripheral devices. I had written some code to pull data off of a SCSI DAT tape drive for a contracting client. I got a call a couple of months later “hey, your code is corrupting data. Come in and fix it now please.” One of my friends there said “check the cable”. “Oh, no, it must be a software problem, I blame myself in all things.” A couple of hours of impossible results from the device, I decided to check the cable. Bent pins.
Don’t forget the user defaults. There may be settings in your NSUserDefaults that prevents you from reproducing the bug. SwitchUp (formerly called RooSwitch) lets you swap around user default settings. You can do development as a user who has never run your app before. You can have a set of preferences typical of a power user that you can swap in to do more testing. And you can also have your own preferred settings for when you use your own app.
Step 6 – Narrow it down to the root causes
This is where most of us start the debugging process. We get a bug report and then set breakpoints and run the app. Or maybe we throw some caveman debugging at it. Or maybe both at the same time.
Narrowing things down to the root cause frequently involves binary search. Divide and Conquer. It’s hard to hide from an O(logN) algorithm. Come up with experiments that let you implicate (or exonerate) large swaths of code. Source code control is your friend during this process. Feel free to hack and slash. Comment out entire functions, or maybe have an early return with a constant value. Replace a library with mock objects. Randomly change ++s to –s. There are no rules. Your task is to come up with testable ideas and run experiments. Then use the data from the experiments to zero in on the problem. Sometimes inspiration will strike. That’s awesome. If not, you have to keep grinding away at it.
Crashes are embarrassing, but frequently you’ll have a nice stack trace pointing right at the problem. I much prefer to track those down than some flaky intermittent bug.
Step 7 – Repair or replace the defective component
You’ve now found the root cause by providence or brute force. Now it’s time to make the fix. Solder in a new capacitor. Replace an engine’s oxygen sensor. Fix the parameters to a UITableView call. This is when you revert-out all the changes you made, create a new feature branch in git, and make your fix.
There’s a big difference between this step and the prior one. Before, you didn’t understand what was wrong, so all the code changes you were making probably had massive collateral damage with the rest of application. Now you know what’s going on, so you should understand the fix and its impact on the rest of the codebase.
You might have heard the advice “You must understand what you’re doing when debugging.” That applies to this step. Not to Step 6. I did myself a big disservice early in my career by embracing the idea of “If I hack during debugging, I’m unprofessional!” Needless to say I was paralyzed as a debugger. I don’t understand what’s wrong! If I understood it’d be fixed by now! The idea that you must understand things before you change code only applies to this step. Feel free to hack, slash, and completely have no idea what’s going on while you’re struggling to get to the point of understanding. That’s actually half the fun.
This is the point in time where you make the code changes and get your code review.
Step 8 – Test
Did the symptom go away? Did you cause any new problems? Make sure you haven’t done any harm to the system. An automated test suite is a nice thing to have at this stage.
Step 9 – Take Pride in your solution
This is the weird one. You could easily stop with the previous step and consider yourself successful. But we are creatures that crave feedback. This is a situation where good feedback is definitely deserved. You’ve reduced the entropy in the universe by fixing this bug. Gloat over it. Bask in it. Remember my friend Jeff with the Christmas lights? That is taking pride in a solution.
Now is also a time of reflection. What worked well? Maybe keeping a log really helped to keep the work on-track. Do more of that next time. What didn’t work well? Maybe a notification spy created way too much volume to wade through and burned a lot of time. This is the time to self-evaluate and figure out what you need to do to level up .
Step 10 – Prevent Future Occurrences of this problem
Now that you’ve fixed-and-basked, spend a little thought time on how to prevent this problem from happening again. You’ve just spent a chunk of your life, which you’ll never get back, solving this problem. You don’t want to solve it again. Repeat bugs are embarrassing. And they’re embarrassing too.
What can help? Crank up the warning level? Run the static analyzer more? Would test-driven development and more automated tests have caught it? Educate fellow developers on dark corners of the language?
Be careful of over-compensating here. It’s really easy in an organization to add new requirements to existing processes which address newly found kinds of bugs. Eventually your release process becomes a 57 page contract requiring buy-in from 18 different teams, and everything grinds to a halt because nobody wants to release any new software any more.
Go forth and Troubleshoot
If you want to learn more about the Universal Troubleshooting Process, visit Steve Litt’s site at Troubleshooters.com. There are ebooks that describe the process in much more detail than the high-level overview I’ve given here. Steve offers on-site training and consulting if you want to learn about this stuff directly from its creator.
As you can probably tell, I’m a fan of the process, with it being a way of codifying a thought process and approach to finding and fixing problems.