This post was written under Evan Hubinger’s direct guidance and mentorship, as a part of the Stanford Existential Risks Institute ML Alignment Theory Scholars (MATS) program.
TL;DR: AGI is likely to turn out unsafe. One likely way that can happen is that it fools us into thinking it is safe. If we can make sure to look for models that are ineffective at "bad" things (so it can't deceive us) and effective at "good" things (so it is useful), and importantly, do that prior to the models reaching a point-of-no-return of capability, we can avert catastrophe. Which spaces of algorithms do we look in? What do they look like? Can we characterize them? We don't know yet. But we have a very concrete point in such a space: an "LCDT agent". Its details are simple and we'll look at it.
Format note: The original post is already pretty well-written and I urge you to check it out. In trying to summarize an already well summarized post (to any alignment researcher anyway), I've aimed lower: catering to a dense set of possible attention-investments. This (ie. the linked dynalist in the first section) is an experiment, and hopefully more fun and clarifying than annoying, but I haven't had the time to incorporate much feedback to guarantee this. I hope you'll enjoy it anyway.
Epistemic status: I'd say this post suffers from: deadline rushedness, low feedback, some abstract speculation, and of course, trying to reason about things that don't exist yet using frameworks that I barely trust. It benefits from: trying really hard to not steamroll over concerns, being honest about flailing, being prudent with rigor, a few discussions with the authors of the original post, and its main intent being clarification of what someone said rather than making claims of its own.
Click on a bullet to expand or collapse it. Try a more BFS-ish exploration than a DFS one. If you've used something like Roam, it's similar.
The rest of the post assumes you've clicked through to the summary above and finished reading it!
We're Not Really Doing Decision Theory
If you're used to reading about Newcomb-like problems and comparisons of various kinds of correct ways to think about the influence of your action, LCDT is obviously just a lie. And as we all know, it's silly to program a lie into your AGI; no consequentialist will keep it around any more than it will keep around 2+2=5 built into it!
But normative decision theory is not the focus here. We just want to have something concrete and easily formally-manipulable to predict how the agent might intervene on things. In particular, we want to have it not want to intervene via other agents (including itself). You should see it less as a belief about how the world works and more as a specification of an objective; how it wants to do things.
For example, say you use CDT internally to make your decisions. Then the action you want to output given a decision problem is the action that has highest expected utility after counterfactual surgery.
Now if you take the expected utility framework for granted, this is just incorporating an epistemic model of how "influence" works in the world.
- This, in fact, is common when we usually compare, for example, CDT vs EDT vs FDT. They are all expected value maximizers, with differences in what they use to model what an intervention influences, when calculating expected value.
Now looking at LCDT, you might say "Well, this is just a wrong model of the world. Agents can in fact affect other agents. This is the classic error of betting on the agent being persistently stupid, like building in facts like 2+2=5. You can't expect that to be a stable part of the model if this agent is to be at all capable."
Here is where you want to keep the totality of "Decision Theory = What you care about How to calculate which action to take to get what you care about" but reject a clean, necessarily-sensible decomposition.
- Question for author: Are you allowed to do that, in general? Or is it a kind of cleverness that pushes the consequentialism deeper and just make it harder to see how things could fail?
So it' s more of a "I care about maximizing my reward, but not if it happens via other agents" rather than "I don't believe I can effect things if it happens via other agents". Similar to reflective humans saying "I want to do the right thing, but not if it involves killing babies".
- Note: Given how feeble human optimization powers are, I doubt that this analogy can be pushed too far. But Eliezer seems to want to, at least in reverse.
Here's another analogy: Taking Koen's label of LCDT as "mis-approximating the world" literally would be like responding to someone who says "I am turned on by pain and seek it out" with "That's a bad model of reality, because pain is the nervous signal that makes you averse to things." It's a non-sequitur. They are just wired that way and epistemic updates are not going to change that.
- Note: I'm not saying that Koen is imputing these connotations (clearly he isn't), only that this is the connotation to avoid.
Of course, as someone reasoning from the outside, you could simply say "the agent behaves as if it believes..." instead of worrying about the subtleties of epistemics vs ethics. But you will still have to follow any weirdness (such as outcome-pumping) that comes from its "behaves as if it believes it can't touch humans", to catch whatever buckles under optimization pressure.
Accordingly, the benchmarks to test this decision theory aren't complicated Newcomblike problems, but a mixture of very basic ones, as you saw in the summary. After the rather brutal mutilation of its graph to ensure myopia, the operating question becomes "does it even manage to do somewhat capable things" rather than "does it get the answer right in all cases".
This might seem like an obvious point, but it's important to orient which part of the thesis you want to concentrate your rigor-insistence on, as we'll see in the next few sections.
A key confusion that arose for me was: where the heck is the simulation (of HCH etc) in the model coming from?
Either we already have a simulable model of HCH coming from somewhere and all of the cognition captured by LCDT is merely choosing to output the same thing as the simulable model it runs. In which case it is perfectly useless, and the real safety problem has been shifted to the not-pictured generator of the simulable model.
Or, more sensibly (especially when thinking about performance competitiveness), it learns the model on its own. But in that case, how does it update itself when it doesn't even believe it can (or, more appropriately given the previous section, doesn't even want to) influence itself? And so how did it get to a good model at all?
It felt like I was running into a bit of unintentional sleight-of-hand in the post, with an assertion of capability for a thing X but proof of myopia for a slightly different thing Y.
Conflating, for example,
- X: The LCDT agent, an agent that uses LCDT as its decision theory to do its decision making parts, that feeds the decision theory with whatever informative inputs it needs (such as the causal DAG and the agent annotations)
- Y: LCDT, the decision theory that needs as input the full DAG, the actionset, the utilities, and the decision-node-annotations
...leads to the capacity (and therefore potential safety issues) coming from the non-Y parts of X, as described at the beginning of this section. Because to use the Y part would be to use LCDT to make epistemic decisions. That's a no-go (at least, naively), to the extent that deliberate learning requires deliberate self-modification of some kind. And the non-Y parts have not been proven to be myopic.
Or, since we're not really trying to do decision theory very formally as much as identifying the objective, we might only loosely distinguish:
- X: Informal reasoning about decision theories to point at specification of objectives, as a way to think about how we might want a model to act in the world
- Y: Formal decision theory, as a mathematical function that selects an action given an actionset (and other facts)
...which means any capacity (and therefore potential safety issues) comes from the specific details of how X is implemented in full. Y, OTOH, is a very simple computation over a DAG.
You could posit that learning the DAG is something that would be updated by SGD (or whatever other optimization process); the agent wouldn't explicitly choose it anymore than you explicitly ran through a DAG of epistemic choices when you were two years old and shifting around your neural wirings.
So there's another very similar possible sleight of hand, where:
- Some important knowledge/capability will be developed by a powerful optimization process that isn't part of the DAG, so isn't limited by LCDT's refusal-to-learn
- We've proved that everything executed based on the now developed DAG is myopic and therefore not deceptive
But is that so? Are these really deal-breakers? If you want to stop and try answering yourself, pause reading now.
Here is where you, dear reader, ask several questions and I say "open problem". This may be because no one knows, or simply that I don't know and didn't get around to figuring out/asking. Either way, plenty for you to chew on! Let me demonstrate:
How do we annotate the causal DAG with "agents"?
It's an open problem to detect when something should count as an agent. There are some preliminary stabs in the original post.
Isn't that going to be sensitive to the granularity, and purpose, of the DAG? Like, I'm both a smart human and a protein machine. How would we prevent it from modeling me at too fine a level and use my atoms for something else?
Yes, this is an open problem that might be a subproblem of the one before it. Maybe you could insist that it shouldn't influence via things labeled "agents", if that happens at any level of abstraction?
Wouldn't that just pass the optimization pressure to moving away from generating abstractions that contain agents, so it can have more actuators in the world?
Serves me right for recommending an ad-hoc patch. Can we file this back under "open problem" please?
How do we make sure that the agent we arrive at in our training is indeed this strange construction that you call an LCDT agent?
I'm afraid that's an open problem.
Wouldn't the model be unable to learn anything at runtime because it can't plan a self-modification? How exactly is it going to be (performance) competitive?
Yes, this is a fair question. Maybe something like epistemic decisions could be made precise and workable. However, it might be equipped with powerful optimization capacity at runtime that search for good algorithms to execute, without trying to model itself.
...and that isn't going to spawn off mesa-optimizers?
Yes, this is an open problem. But if the LCDT agent doesn't manage to solve that for itself, it won't really be able to do well on the reward signal we're training it on either, so that could give us some hope.
And what if it spawns off mesa-optimizers in its unrestricted search that work for its base objective but not for us? Is this an open subproblem or the alignment problem?
No one said it had to be unrestricted! Maybe it could try to make sure to search for only LCDT agents itself?
...Okay, I don't actually expect some weird recursion to save us. It's probably best to let the training process (like SGD or whatever) pick out its world-model for it and let it simply act, punting the question of performance competitiveness for now. It seems like it's hard for anyone to avoid alignment tax entirely.
And what's going to make sure that with each training step that updates its DAG (and therefore the model itself), it's going to stay myopic?
This is presumably the same open problem as making sure that our training ends up being an LCDT agent. It's part of the training goal. I imagine something like an LCDT magnet that pulls our search during training. Or maybe we even restrict the model-space to only those that are verifiably running LCDT.
But since we are modifying the agent and expecting it to gain immense capability, couldn't there be some way for SGD to give rise to something fishy inside our agent?
In some sense we're only searching for good DAGs now, rather than any damn algorithm. That seems (a lot) safe(r) given that we know that the use the DAG will be put to is somehow contained by the agent's decision theory.
Again, how to implement the LCDT magnet is an open problem. To challenge the robustness of the actions of the model being routed correctly through its decision theory is to take us back to that open problem. Which, noted. Other questions?
That still leaves open how exactly we're getting it a reliable unhijacked signal to something like HCH for the SGD update?
Good question. There is no guarantee of this happening. The original post has something hopeful to say again: that the simulation would be at least more understandable than any ol' search for HCH, owing to the myopia constraint.
If we're still in the hypothetical of the LCDT agent not participating in the development of the world model, how exactly is LCDT's myopia supposed to help?
If the DAG arrived at involved complicated thinking about agents influencing other agents, then the output will be wrong, since we're running myopic LCDT over it. This will disincentivize such DAGs and incentivize more direct ones instead.
Wouldn't that make it less simple, given that HCH is more naturally a bunch of agents?
Hmm, maybe. There's some similar-looking discussion going on here, but it seems to assume that the LCDT agent is making epistemic decisions (I may be wrong about this, but I don't think so), which, as argued above, seems quite tricky at best. I do agree that it's likely to prevent mindcrimes. Another point Evan makes there is that the LCDT agent just needs to figure out "how to decompose its own decision nodes into individual computations so it can efficiently compute things internally and then know when and when not to label its internal computations as agents", which as he notes is a problem for all CDT-based agents.
Hold on, so now it's allowed to decompose decision nodes if it really really wants to influence them (in particular here, itself)?
Aaaah ummm hmmmm. Let's leave this to someone interested in formalizing lonely epistemic decision theory.
I also don't see a detailed outline preventing the causal model of the world from becoming arbitrarily "inaccurate" (in modeling reality) over the course of training, while still reducing loss. Because LCDT is not following a "real" decision theory, the combination of a broken DAG + broken decision theory might become more effective from "errors" cancelling out structurally. Couldn't that allow for deception?
I'm not sure if this is a hard-to-avert problem, but this does sound exactly like the kind of "make the consequentialism harder to see" concern raised as a note for the author earlier. Hopefully you could just check for/incentivize an accurate DAG alongside.
And those are just ways to screw with things on the inside. Even if you were able to ameliorate that, what about screwing with things on the outside? Deceive agents by rewiring the world around them rather than trying to influence them?
There wouldn't magically emerge a deceptive plan to fool humans this way, because of the rather absolute "sacredness" LCDT holds for agents' autonomy. But it is likely to just happen to discover that rewiring works very conveniently for it, yeah. How exactly to extend to cover this is an open problem, but there's some suggestions and discussions here. I'd guess that something like "don't influence things downstream of an agent either" is unlikely to work. Maybe ordinary intuitions like "let people know when you've messed with their environments" could work?
Wouldn't that become a channel for it to manipulate the huma
Apologies, sometimes I get ahead of myself and forget that we're playing the "open problem!" game.
In that case, let me also raise the concern of vulnerabilities that arise when you want your agent to operate over causal structures rather than pure sense data.
Thank you: open problem!
From one POV, I'm not sure what's left of LCDT, really. Did we only shuffle around the core issue or also do a little bit of reduction, find some insights to follow up on along the way? I'd like to think so, but I can't say I feel optimistic about solving all the open problems from earlier.
From another POV, being in this state feels like par for the course for any honest analysis of (inner) alignment proposals, barring clear breakthroughs.
Either way, it still fulfils its main function very well: an existence "proof". Something very intriguing and concrete to play around with when trying to understand what the hell deception is, is very useful to at least test your formalization and safety-mindset mettle.
 ...or maybe it is? Consider this post, for humans.