1541

LESSWRONG
LW

1540
AI
Frontpage

15

Diagonalization: A (slightly) more rigorous model of paranoia

by habryka
16th Nov 2025
6 min read
0

15

AI
Frontpage

15

New Comment
Moderation Log
More from habryka
View more
Curated and popular this week
0Comments

In my post on Wednesday (Paranoia: A Beginner's Guide), I talked at a high level about the experience of paranoia, and gave two models (the lemons market model and the OODA loop model) that try get us a bit closer to understanding its nature and purpose. 

I then made a big claim that went largely unargued in the post, that there are three kinds of strategies that make sense to pursue in adversarial information environments: 

  • You blind yourself
  • You eliminate the sources of deception
  • You act unpredictably

Now, Unnamed brought up a very reasonable critique in the comments! Why would there be exactly three strategies that make sense? How can we have any confidence that there isn't a 4th kind of strategy that works? 

And, in reality, the space of strategies is huge! Many of the most effective strategies (like building networks of trust, hiring independent auditors, performing randomized experiments, and "getting better at figuring out the truth on your own") don't neatly fit into the categories in that post. Maybe they can somehow be forced into this ontology, but IMO they are not a great fit. 

But I argue that there is a semi-formal model in which this set of three strategies fully covers the space of possible actions, and as such that decomposing the space of strategies into these three categories is more natural than just "I pulled these three strategies out of my bag and randomly declared them the only ones". This semi-formal model also introduces the term "diagonalization" which I have found to be a useful handle.


I think "paranoia" centrally becomes adaptive when you are in conflict with a "more competent"[1] adversary. Now, we unfortunately do not have a generally accepted and well-formalized definition of "competence", especially in environments with multiple agents. However, I think we can at least talk about some extreme examples where an agent is "strictly more competent" than another agent.

One such possible definition of "strictly more competent" is when the more competent agent can cheaply[2] predict everything the other agent will do (even including how it will react to the bigger agents attempts at doing so). In such cases the stronger agent in some sense "contains" the smaller agent: 

When a larger agent contains a smaller agent this way, the smaller agent can simply be treated like any other part of the environment. If you want to achieve a goal, you simply figure what action of yours produces the best outcome, including the reaction from the smaller agent.

You can solve this optimization problem with brute search if the input space is small and the agent and environment is deterministic, or something like gradient descent if the input space is big and the agent is nondeterministic. 

I often refer to this as the act of "diagonalizing" against the smaller agent.

Sidebar on the origin of the term "diagonalization"

I've encountered the term "diagonalization" for this kind of operation in MIRI-adjacent circles. I am not even sure whether I am using the term the same way they are using it, but I have found the way I am using it to be a very useful handle (though with a terribly inaccessible name that IMO we really should change).

The origin of this term is unclear to me but the first mention that I can find for it is this 2012 @Vladimir_Nesov post. Applying the ideas in that post to the situation of having a "larger" and a "smaller" agent roughly looks as follows (please forgive my probably kind of botched explanation, and I invite anyone who was more involved with the etymology of the term to give a better one):

The problem with trying to predict what an adversary will do in response to your actions is of course that they will be trying to do the same to you.

Now, let's say the smaller agent was trying to predict what the bigger agent was doing and trying to adapt to that. Then the bigger agent could simply use their simulation of the smaller agent to identify which scenarios the smaller agent chose to predict and adapt to, and then choose an action outside of that set, similar to how in Cantor's diagonal argument you find a way to prove the reals are of greater magnitude than the rationals by finding a way to list all the rationals, then stepping outside of that set via choosing diagonal entries from each item in the list: 

We can proof by contradiction that if one agent is capable of predicting another agent, the other agent cannot in turn do the same. If the smaller agent was also perfectly predicting the bigger agent, then the bigger agent couldn't be perfectly predicting the smaller agent, as doing so would trigger an infinite regress (and break the halting problem). As such, there must be at least one scenario in which the smaller agent wasn't able to predict what the bigger agent would do.

Now, in the situation of facing an opponent "strictly more competent", as defined above, your choices are quite limited. You have been "diagonalized against", every move of yours has been predicted with perfect accuracy, and your opponent has prepared the best countermeasure for each. The best you can do is to operate on a minimax strategy where you take actions assuming your opponent is playing strictly optimally against you, and maybe try to eke out a bit of utility along the way.

However, the model above does suggest some natural weakenings of "strictly more competent" that create a bit more wiggle room.

In any realistic scenario, in order to do something akin to diagonalizing an opponent, you need to do the following:[3] 

  1. You need to get information about their internal workings to build a model of them
  2. You need to sample[4] that model to extract predictions about their behavior
  3. You need to identify parts of the model's input space of the model that reliably produce the actions that you want, conditional on having observed your actions

And each of the three component strategies of paranoia I argued for in Paranoia: A Beginner's Guide, addresses one of these: 


 

1. By blinding yourself to information channels that are more controllable, you force an opponent to search harder for inputs that produce the behaviors they want[5], making it harder to come up with reliably adversarial inputs (i.e. step 3)


2. By removing the adversarial agents from your environment you make it harder for those adversaries to get information about you and to build a model of you in the first place (i.e. step 2)[6]


3. By making yourself erratic and unpredictable you increase the number of times you make yourself more costly to predict, usually requiring many more samples to get adequate bounds on your behavior, making the cost of predicting you higher (i.e. step 2)


Overall the set of paranoid strategies in Paranoia: A Beginner's Guide  was roughly the result of looking at each step in the process of simulating and diagonalizing against another agent and thinking about how to thwart it.

But how well does this toy model translate to reality? 

I think it's pretty messy. In-particular, many strategies best suited to adversarial information environments rely on forming agreements and contracts with other agents that are not adversarial to you, e.g. to perform the role of auditors. The above model has no space for other agents that are not you and the bigger agent. 

The strategies I list are also all focused on "how to make the enemy be less good at hurting me" and not very focused on "how do I perform better after I have cut off the enemy (via the strategies of paranoia)". When thinking about strategies adaptive to adversarial environments "learning how to think from first principles" is IMO basically the top one, but since the above model is framed in a zero-sum context, we can't speak much about upside outside of the conflict context.

But overall, I am still quite happy to get these models out. I have for years been warning others of "the risks of diagonalization" and been saying insane-sounding things like "I don't want to diagonalize against them too hard", and maybe now anyone will actually understand what I am saying without me having to start with a 20-minute lecture on set-theory. 


Postscript

Ok, but please, does anyone have a suggestion for a better term than "diagonalization"? 

Like, the key challenge that all alternatives I can think of lack is the flexibility of this word. It has all the different tenses and conjugations and flows nicely. "That's diagonalization", "He is diagonalizing you", "I am being diagonalized" are all valid constructions. Alternatives like "adversarial prediction" are both much more ambiguous, and don't adapt to context that well. "That's adversarial prediction", "He is adversarially predicting you", "I am being adversarially predicted" sound awkward, especially the last one. 

But IMO this is a really useful concept that I am hoping to build on more. I would like to be to use it without needing to give a remedial set-theory class every time, so if anyone has a better name, I would greatly appreciate suggestions.

  1. ^

    Or an adversary with more time to spend on a conflict than you have. 

  2. ^

    "Cheaply" in the limit meaning "the stronger agent can do this for a weaker agent as many times as they like". This is of course quite extreme and runs into the limits of computability, but I at least for now don't know how to weaken it to make it more realistic.

  3. ^

    But "Habryka, stop!" you scream, as I justify one "list of three things that intuitively seem like the only options" with another "list of three things that intuitively seem like the only options", and you know, fair enough. But look man, our toy model in this situation really has many fewer moving parts, and I think the argument for why these are the only three things to do is more robust than the previous one. 

  4. ^

    You don't actually need to "sample" it, though it's of course the most natural thing to do. I can predict the outputs of programs without sampling from them, and similarly having formed a model of another agent, you can do things much more sophisticated than simply sampling trajectories. But for simplicity, let's talk about "sampling", and I think this shouldn't change any of the rest of the argument, though honestly I haven't checked that hard.

  5. ^

    Or, in practice, force them to take more costly actions to control a larger part of your input space. This however is outside the realm of the narrow semi-formal model I am proposing here, as we are not modeling actions as having costs. It probably wouldn't be too hard to properly add to the model, but I haven't tried it.

  6. ^

    As well as of course potentially eliminating the bigger agent altogether, which is not addressed in this model, as death is not part of our tiny little toy world.