Robust Software Isn't About Error Handling

CloudFlare recently had an incident where some code expected that a list would never contain more than 20 items, and then it was presented with a list of more than 20 items. Internet commenters rushed to point out that the problem was that the code was written in Rust, or that the source code had the word unwrap^[1] in it. A surprising number of people argued that they should have just "handled" this error.

I think this is wrong, and it completely misses how software is made robust.

To make software robust, you need the overall system to continue working despite the failure of any component, not a way to prevent any component from failing (since that's impossible).

You can't always fix errors

When CloudFlare's code reached the infamous unwrap, the damage had already been done. The array was pre-allocated and the config was longer than the array, so what could they do in an error handler?

Log something and then still fail? Return an error instead of panicking so the failure is more aesthetically pleasing? Dynamically resize the array and cause an out of memory error somewhere else in the code?

There's really no path to fixing this error without the benefit of hindsight^[2].

How do you make robust software then?

If you can't guarantee that your code will succeed, then how can you make software robust? The only option is to make the whole system work even if failures occur.

Some options include:

Testing the code so if a failure occurs, you discover it before it causes problems
Isolating different components, so if one fails, only part of the system is affected
Duplicating components so if one fails, another can handle the same job^[3]

In the case of the CloudFlare bug, this would look like:

Testing the new config in a staging environment or a canary deployment
Designing the component so a failure causes a smaller outage
Having multiple independent versions of this component and using the other version if an error occurs^[4]

The problem here wasn't that "Rust isn't safe lol", it's that the overall system couldn't handle a mistake in the code.

^{^}
unwrap is a type of assertion in Rust.
^{^}
Obviously one way to handle the "array is too small" error is to have a larger array, but assuming programmers will never make a mistake is not one of the ways to write robust software.
^{^}
Like how flight computers work. Importantly, the redundant software needs to be written by a different team and in different ways to reduce the risk that they'll have the same bugs.
^{^}
Again, these need to be independent implementations. Since this was a rewrite, using the old version as a backup was an option, but it sounds like the new version was based on the old version, so it had the same bug.

LESSWRONG
LW

LESSWRONG
LW

8

Robust Software Isn't About Error Handling

8

8

You can't always fix errors

How do you make robust software then?