I have run large, complex software systems in production. And if I could give people one piece of advice, it's that there are no easy answers. Here are handful of examples of things won't solve all your problems:
I could go on a for a while.
But then, once you have become deeply pessimistic and paranoid, you finally build a system that runs flawlessly for 4 years. It never needs bug fixes. It never needs attention. It just sits in a corner, doing its job perfectly. In fact, people forget how it works. Then they forget that it's there. People move on. Compiler versions get upgraded. Your CI system gets replaced. Management goes through two different wiki initiatives, losing information each time.
And then one day, someone deprecates an old version of TLS, and ancient system stops being able to talk to some API. And then the world burns.
So if your model is "a single unwrap shouldn't bring you down, because you should obviously have been doing A, B and C elsewhere", then you're probably just trading off different kinds of disasters.
A better model is "We'll fix potential failures at every possible level. And hopefully, when the shit finally hits the fan, then at least one of those levels might hold successfully." So you have staging and monitoring and fallback systems and extensive testing and a chaos monkey and documentation and API "fuses" and back-pressure and load-shedding and proofs and paranoid code reviews and incremental rollout and root cause analysis. And so you fail less, and less, and less. But one day, that fact that you wrote an actual, sensible behavior for 21 items and tested it? That will be what prevents some ludicrous cascading failure.
TL;dr: It's nice that you have multiply redudant horseback messengers. But still, check the nails in their horseshoes regularly, because for want of a nail, an exciting new complex failure mode was discovered, and the battle was still lost.
This has multiple implications for AI alignment, and few of them are good news.
As I read it, the interesting part isn't around the unwrap operation. The interesting part is that a new version was being pushed out incrementally, and the infrastructure components that read from the new version were failing consistently. But the awareness of the failure did not propagate back into the component doing the push.
The point of an incremental rollout is to allow a fraction of traffic to be affected by the new version, as a test of whether the new version is actually good. In order for that to work, failures caused by the new version have to be detected; and that awareness has to propagate back to the system doing the rollout, and cause the rollout to stop.
Or you may be doing a rollout of a new abcserver, and the monitoring system tracks that the abcserver itself is successfully starting and taking queries ... but not noticing that the new abcserver's responses are causing the jklserver (somewhere else in your infra) to crash.
One possible rule is, "If it looks like there's a global sitewide outage happening, look up all the currently in-progress and recently-completed rollouts. Send an alert to all of those teams. Everyone who's responsible for a recent rollout gets to ponder deep in their hearts whether there's any possibility that their rollout could be breaking the world, and decide whether to roll back."
Another possible rule is, "You can afford to have an agent (currently, probably a human) watching over each rollout; an agent who cares about whether the whole site is crashing hard, and who gets curious about whether their rollout might be responsible."
CloudFlare recently had an incident where some code expected that a list would never contain more than 20 items, and then it was presented with a list of more than 20 items. Internet commenters rushed to point out that the problem was that the code was written in Rust, or that the source code had the word
unwrap[1] in it. A surprising number of people argued that they should have just "handled" this error.I think this is wrong, and it completely misses how software is made robust.
To make software robust, you need the overall system to continue working despite the failure of any component, not a way to prevent any component from failing (since that's impossible).
You can't always fix errors
When CloudFlare's code reached the infamous
unwrap, the damage had already been done. The array was pre-allocated and the config was longer than the array, so what could they do in an error handler?Log something and then still fail? Return an error instead of panicking so the failure is more aesthetically pleasing? Dynamically resize the array and cause an out of memory error somewhere else in the code?
There's really no path to fixing this error without the benefit of hindsight[2].
How do you make robust software then?
If you can't guarantee that your code will succeed, then how can you make software robust? The only option is to make the whole system work even if failures occur.
Some options include:
In the case of the CloudFlare bug, this would look like:
The problem here wasn't that "Rust isn't safe lol", it's that the overall system couldn't handle a mistake in the code.
unwrapis a type of assertion in Rust.Obviously one way to handle the "array is too small" error is to have a larger array, but assuming programmers will never make a mistake is not one of the ways to write robust software.
Like how flight computers work. Importantly, the redundant software needs to be written by a different team and in different ways to reduce the risk that they'll have the same bugs.
Again, these need to be independent implementations. Since this was a rewrite, using the old version as a backup was an option, but it sounds like the new version was based on the old version, so it had the same bug.