Insert Error Handling Here

Blue Screen of Death on public monitor

Programming textbooks, documentation, and example code all tend to gloss over error handling.

There are legitimate reasons for this: error handling code takes up a lot of space, and that can fill up a text book. Since it can take up more space than the “straight-path” code, it can be distracting from the goal of teaching that straight path. So we end up with example code with comments like:

// add error checking here

As a result of this, many software developers never learn good error handling techniques.

See?

A few years back, I went in for LASIK eye surgery. That’s where a computer-directed laser burns your eye into the proper shape so you can see better (or something like that; the details aren’t important). When researching my options, I read advice that repeated the importance of having a good doctor in charge. In almost all cases, the computer-directed laser does a fine job and the doctor is mostly a bystander. But in the off chance that something goes wrong, you want someone who can step in and set things right.

That’s the importance of error handling.

You may test your software on crystal clean data on a super-high-speed internet connection, with well-behaved users, on a device with tons of storage space.

But in the real world, in the wild wild west where software actually gets used, those things may not be the case. Phones run out of storage; internet connections go out or slow down; users tap the wrong buttons, forget their passwords. Some even intentionally break things. Data gets corrupted, and is often wrong and incomplete.

Your code needs to handle all these things.

The test of a robust program is not what happens when all conditions are perfect, but how it handles those situations when something goes wrong. Excuse the dramatic title, but in this sense error handling is programming: you can’t do good programming without good error handling.

General techniques

It’s hard to write general advice about error handling, because the appropriate behavior varies so much between products. But it’s a good idea to start at the top level.

How do you want to present errors to the user? It’s better if your program can resolve the issue without involving the user at all, but of course that’s not always possible. Therefore, take some time to think about how to let the user know that something has gone wrong. The user experience for error-cases is often an afterthought, but that is when your users’ sense of goodwill is at its most vulnerable, so you should make sure to reassure them.

Differentiate between technical error descriptions (like “invalid JSON response”) and user-facing messages (like “there was a problem getting your information”). You need both. You may even want a marketing or UX team to write the user-facing messages. Get them translated into whatever languages you support. Make the messages professional, not cutesy. (Saying “oopsy! ?” to an already annoyed user is not going to go well.)

Show the user-friendly errors to the user, and write the technical details to a log file or service.

The technical error report should include as much information as possible: anything that could help you track down what went wrong. For example, include the full text of a network response, if feasible. In pre-release builds, you could even include source files and line numbers.

It’s useful to have a general logging system in place too, so you can look back in time to see what led up to the error.

One word of caution: be careful not to create a security risk with your logging. Avoid storing sensitive data such as credit card numbers, passwords, or access tokens in a log file or crash log.

Recover if possible

As a user, there are few things more frustrating than an error message like “something went wrong,” with no explanation given and no way to continue your task. Defensive programming means you anticipate certain classes of error and plan ways to recover from them.

A good example of this is a network connection failure. In a mobile app especially, you should expect your network calls to fail occasionally. In most of my mobile front-end code, I automatically retry a certain number of times with an exponential back-off.

You don’t have to be that sophisticated, but a simple “retry” button can go a long way towards improving the user experience.

Validate external data

It should go without saying that your app should not crash on unexpected data. Invalid input, an incorrect response from an API, corrupted filesystem data. You should handle these things properly.

A good rule of thumb is that nothing that comes from outside your code should be trusted unless you validate it. Think of the validation step as a firewall between the untrusted outside world and your nice safe environment full of trustworthy data.

Handle validation errors as best you can. In the worst case, you may lose some data. Best case, you can retry and get a better result.

Crash during development if it’s a programming bug

While in the released product recovering from errors is a good goal, sometimes in development you prefer to crash immediately, so you can learn about things that go wrong.

Fail early and fail mindfully. Most popular languages allow for assertions: a way to trigger fatal errors in a development setting but let things pass in released code. Assertions are a valuable tool for catching programming errors. They also serve to document programming assumptions.

As a simple example, if you have a function that can only handle integers within a certain range, include a line at the beginning of the function like so:

assert(x > 0); // Results are undefined for non-positive numbers.

A simple comment can help with debugging if the assertion fails.

Test bad inputs

Make sure your test routine involves bad data. Test numbers that are too high and too low. Test numeric inputs that are not valid numbers. Test bad JSON, missing parameters, strings that are too long or too short.

In other words, try to break your code. When I do this type of testing, I almost always find some plausible error case that I hadn’t properly handled.

Decide on an appropriate level of robustness

The amount of error handling you would use on a competitive mine-sweeper game is different from you would use on, say, an actual mine-sweeper. You may choose not to handle certain classes of errors, and as long as you are aware of the tradeoff you’re making, that is a perfectly valid decision.

Weigh the risk of the error happening versus the cost of adding complexity to your code. Some factors to consider include:

  • How likely is the error?
  • How catastrophic would the consequences of the error be (at one extreme, physical injury, and the other extreme, a minor glitch in video playback)?
  • How hard is the error to recover from?

For example, an out-of-memory error is one I often choose not to handle, because they can be extremely hard to handle and don’t come up very often, even on smartphones.

And More

Of course, there is much more to error handling and the writing of robust software than I can cover in a short blog post. Hopefully this got you thinking more about this important topic.