tl;dr: Novel framing on the trivial point that your models may not be accounting for all relevant factors. I find it useful for improving the quality of my thinking on the topic. Asking yourself "is the type signature of my design for achieving X actually 'a design for achieving X'?"[1] is a good prompt for checking which parts of your model you have uneasy doubts about, in your heart of hearts.
1. In acoustics, there's something called the RESWING effect. Suppose you want to decrease noise in some area, so you surround it with soundwalls. Your model is simple: separating contiguous chunks of air by hard barriers impedes interactions between those chunks, which should impede sound propagation.
One factor turns out to foil this plan: in atmosphere, the sound speed isn't fixed. It increases with height, due to generally higher wind speeds and temperature differences. This causes sound refraction: soundwaves bend upwards or downwards, depending on whether the air is colder or hotter towards the ground. In the former case (generally at night), if a soundwave runs into a wall, the refractive effects become stronger due to the discontinuity the wall introduces. This leads to the wave sharply bending downwards, into the would-be shielded area. As the result, at some set distance behind the wall, the noise is actually amplified.
Formally: You model sound waves as obeying the Helmholtz equation , with a constant sound speed . Your walls place boundary conditions on this equation. However, the correct equation is , depends on height , and introducing the same boundary conditions leads to very different solutions.
Thus, something designed to be a noise wall turns out to actually be a converging acoustic lens.
2. Suppose you design an algorithm for high-frequency trading. In theory, HFTs should increase market efficiency, speeding up price discovery and narrowing bid-ask spreads.
However, if an HFT is deployed into an environment with many other HFTs, this can cause a feedback loop where they start playing "hot potato" with each other. A system whose expected effect was that of a market stabilizer turns out to be a (conditional) market-volatility amplifier.
3. From Why Agent Foundations?:
Suppose we’re designing some secure electronic equipment, and we’re concerned about the system leaking information to adversaries via a radio side-channel. We design the system so that the leaked radio signal has zero correlation with whatever signals are passed around inside the system.
Some time later, a clever adversary is able to use the radio side-channel to glean information about those internal signals using fourth-order statistics. Zero correlation was an imperfect proxy for zero information leak, and the proxy broke down under the adversary’s optimization pressure.
4. The Cobra Effect: if you reward people for bringing you dead cobras, in an attempt to set up a cobra-elimination effort, what you actually create is a cobra farming industry.
5. Suppose you're training an AI model not to deceive you. However, your deception-detecting methodology is flawed: some lies slip through. What you're actually training, then, is not something honest, but a good liar.
6. If you try to infer the purpose of a (sociopolitical) system from what it does, it would often fail to match the purpose its designers actually had in mind.
Consider designing a system for a specific purpose, then deploying it to the environment in which it's meant to fulfil that purpose. Abstractly, you can think about it as follows: there's a low-level concrete environment , a high-level representation of that environment , and an "abstraction operator" , mapping from low-level states to high-level states.
You define your system's purpose/type signature in terms of the high-level environment, then invert the abstraction operator, and feed this type signature to it. In theory, the inverted operator then outputs a low-level system such that, if you abstract back up from it, would match the type signature of your target system: . In other words, you search for whatever low-level system would fill the high-level role you actually care about.
However, suppose that your model of the low-level environment is flawed. If so, your abstraction operators don't match the "ground-true" abstraction operators either: the correct operators have a type signature , whereas what you're using is . What would happen, then, if you invert your intended type signature through your flawed ?
At the next turn, Reality would apply the correct to the output of , and you'll most likely discover that : the high-level type signature of the deployed system would not match the designated type signature. In other words, it will turn out that the system you deployed was something entirely different from what you thought you were deploying.
A way to think about is that the system's nature is not solely defined by its own "shape", but also by whether this shape matches some negative space in the environment: whether it correctly slots into/fills some role. If you mismodeled the shape of the role you want to fill, your system's shape would slot into some other role.
So even if you completely and correctly understand how your system's internals function, its true nature may still surprise you.
When designing something, there's often a temptation to mentally "bend" your predictions regarding how it would behave, especially if you have some uncertainty over the low-level deployment environment . To quietly assume that this uncertainty would resolve itself in such a way that your system would behave just as planned. To, essentially, unconsciously hope that the world would understand what you meant, and then just do that. Or, at least, that the pockets of uncertainty won't interact with your system in any non-random way: that it doesn't matter that you're uncertain about their values, because randomizing those values would not impact the result anyway!
In the moment, alternate possibilities may feel like conspiracy theories. Like expecting the world to unnaturally contort itself to foil your plans, rather than to merely refuse to cooperate.
This mindset is easier to slip into if you think about unintended consequences as the result of "mistakes". If you're worrying about "mistakes" in implementing your system, you're implicitly assuming that what you're building is basically the right thing, save for a relatively small amount of potential errors you may let slip through. When staring at your project, you're always cross-referencing what you're seeing with your intended end result, trying to spot deviations. But if you always have that intended end result at the forefront of your mind, it's liable to bias your interpretation of what you see.[2]
The mindset I'm gesturing towards, on the other hand, doesn't think in terms of "mistakes". Instead, it invites you to momentarily forget your "intentions", forget the "purpose" of your project from which its implementation may "deviate", and look at what you're building afresh. What is the actual true nature of that thing? What does it ask the world to do?
You may discover that it has nothing to do with your intended goal at all; that it isn't even close to it in concept-space.
And that's the way the world would look at it. The world is indifferent and unintelligent. It doesn't care or understand what you meant, and it won't query your plans and intentions when deciding how to interpret what you fed it.
The main thing on my mind here is, of course, the AGI doom.
Goodharting is an obvious and important example of the principle I'm talking about, but it's broader. When your soundwalls turn out to function as acoustic lenses and blast someone with noise, or when your stable bridge behaves as a self-oscillation amplifier and collapses, it's not because something out there Goodharted to the sound-suppression/bridge-stabilization task.
Similarly, the AGI risk is not centrally about a powerful agent deliberately scheming to deceive you and kill you. It's a salient and important failure mode, but it's very far from the only one. The core issue is that, by default, you don't really know what you're building until you deploy it; what role the world would assign your system. Unless you've rigorously eliminated all uncertainty, all you have are hopeful fantasies about the world cooperating with your vision in all the places where that uncertainty exists.
And if trial-and-error is extremely costly, so lethally costly you can't pay it at all... Well, it gives some idea regarding the mindset and the methodology you need to adopt if you want to first-try it.
This is probably a sazen; i. e., it summarizes this post if you're already familiar with the relevant mental motions, but doesn't substitute for it otherwise.
In the standard predictive-processing way. There's something you expect your system to be, so your brain would propagate that prediction top-down, biasing your attempt at a bottom-up understanding/error-spotting. You would see what you expect to see, and look past what you actually see.