(I wrote this in reply to a draft; apologies if the post has been load-bearingly updated since then.)
Consider a future AGI. The best argument I currently know of for why "general intelligence" is a thing at all (as opposed to all intelligence just boiling down to a bag of use-case-specific heuristics) is that search/optimization/world-modeling/planning/etc are naturally recursive; problems factor into subproblems, goals factor into subgoals. So, a natural general-purpose architecture for intelligence involves a "general intelligence module" which can take in a (sub)problem, and recursively pass (sub)subproblems to other instances of the general intelligence module.
Let's assume that general intelligence either does look vaguely like that, or at least can look vaguely like that. (This would be a potentially useful assumption to disagree with!)
Assuming that structure, consider how the different instances of the general intelligence module relate to each other. One module might be able to more efficiently achieve its own subgoal A by hacking/manipulating/overwriting another module's subgoal B; then there would be two modules working toward subgoal A. But a "general intelligence" made o...
I interpret you as saying:
“Let’s say my wool sweater is wet but I want to wear it outside when I leave in 10 minutes. So I have a goal (look nice tonight) and a subgoal (dry my sweater within 10 minutes). Some “module” in my brain is focused on the subgoal, and one thing that could happen is: it finds that there’s really only one way to accomplish the subgoal, and it’s to put the sweater in the dryer on the highest possible heat. It then notices that this will be shrink the sweater to look terrible, which is a problem according to the “look nice tonight” goal, so the “module” “solves” that problem by simply deleting the “look nice tonight” goal! …This scenario doesn’t actually happen, so that calls for an explanation of why not, and that explanation (whatever it is) will be a solution to corrigibility.”
Assuming that’s what you meant: I think the reason the scenario doesn’t happen (in humans) is because “modules” is not as literal a thing as you make it out to be. A better analogy is, like, looking at a big painting. You can’t take in the whole thing at once (among other things, your vision is only sharp at the fovea), so you focus on one part, and then another part, etc. But whatev...
The sweater example is close but doesn't quite hit the nail on the head. It's not that the (dry sweater) planner would delete the (look nice) goal; that wouldn't help dry the sweater! Rather, the (dry sweater) planner would try to commandeer more mental resources, i.e. more planner-modules, steer more attention to drying the sweater. That additional attention potentially helps dry the sweater. But as an accidental side-effect, attention would be steered away from other goals, like e.g. (look nice). In short, the (dry sweater) planner-module can better dry the sweater by redirecting the (look nice) module to focus on drying the sweater instead.
... and that totally does happen in humans! Humans often get caught up in a specific subgoal, lose track of the broader goal which generated that subgoal in the first place, and end up optimizing for the subgoal in a way which doesn't help the original goal. It's the phenomenon of lost purposes, at an individual level.
(Likewise with the painting example: when looking at little patches of a large complex painting, people will totally lose track of context and overlook inconsistencies.)
It really doesn't seem like humans "keep their eye on the ball" all the time, even in the large majority of day-to-day cases where cognition basically works.
John's point applies even if you are doing subgoal solving serially. When trying to get food, I don't naturally consider plans like "shove my fingers down my throat to induce vomiting, so that I'm hungrier and thus buy more food" - even bulimics aren't inducing vomiting as a plot to get their future self to eat more, I don't think!
However, I will consider plans like "tie a string around my finger so I remember to buy food" or "set a reminder".
The difference of course is that I don't like vomiting and think it's unhealthy. When I turn my cognition towards solving the problem of "buy food", this time slice of me still remains 'aligned'. This is true even when I do complicated metareasoning about my memory, attention, and "external cognition" (by my phone) - I have no problem planning around my memory or attention or even thoughts like "I shouldn't open chats right now, because then I'll get sucked in and not but food".
To be clear, this isn't about how good I am at planning around myself; it's about the fact I seriously consider such plans at all, but don't seriously consider vomiting. And as you say, in the moment you sometimes do have different values - but we still need to explain how you can be 'corrigible' to the rest of your problem solving process (I'm not tied to the frame of corrigibility).
A sentence I have found myself saying maybe 3-4 dozen times in the last year has been "I don't think corrigibility as an approach will scale to vastly superintelligent agents, we will have to figure out proper alignment/value loading at some point before then". Most recently someone asked me to explain why I believed this, but I noticed I did not have a great writeup available.
Now I do! And much more than that, when I read this post I found my own thinking on this topic get noticeably clearer and less confused, not just on the topic of corrigibility, but also on other topics of AI strategy, with little gems like:
This AGI would not face anything remotely analogous to the conundrum that humans don’t really know what they want for the long-term future, and are figuring it out, and when they say ‘manipulation is bad’ they are expressing some hard-to-pin-down preference about how this process of self-discovery plays out.
Which is a mini-paragraph that I feel like connects two things that I have been trying to connect for a while, namely "CEV as a necessary component of a good future", and "AI manipulation is a very fundamentally tricky problem".
Thanks a lot for writing this.
I think you're correctly identifying important issues and cracks in the standard ontology, but I think you're throwing out too much baby in an effort to get rid of bathwater.
For example, I do not think it's obvious that "Just like vitalistic force, 'wanting' is conceptualized as being acausal, i.e. an intrinsic property of an entity with no upstream cause." In control theory, we can say that a system controls for a thing based on a small collection of mathematical relationships -- pressuring an error signal towards zero. While the concept of wanting is overloaded and more complex, I think it makes sense to recognize that "X is controlling for Y" is a valid underpinning that has no vitalistic magic. We can ask what led X to control for Y, or how X controls for Y in terms that are closer to the underlying physics; there's nothing acausal or intrinsic (except for the definitions, I suppose).
It’s possible for an intuitive human concept X to be a bundle of connotations that don’t add up to anything coherent, while ALSO there’s some similar concept Y that is mathematically well-defined and is capturing many (but not all) of those connotations. Then we can argue about whether X is “really” an imperfect pointer to Y, versus whether X and Y are different but related. But that’s a pointless argument with no answer.
(Fun example: Dan Dennett wrote a book advocating for free will compatibilism in 1984, and then in 2015 added a new preface saying: well actually on second thought, maybe I should have just said all along that we should abandon the term “free will” altogether.)
Anyway, I stand by my claim that acausality is an aspect of how most people intuitively think about wanting. Here’s an example … it’s possible that this tweet is bad-faith, but regardless, I think Marc wouldn’t have said it if it didn’t have some intuitive appeal:

So anyway, we can say that our intuitions around “wanting” have that incoherent aspect (per §2), and we can simultaneously ALSO say that there are well-defined notions of optimization (e.g. in §3.4 I cite Alex Flint’s) that overlap many aspects of th...
I think this problem is real, and it's not solvable in the limit. But might be fairly easily solvable to a fairly satisfactory degree. Drawing the line between manipulation and help requires a judgment call. But that call can be made by humans. This should give decent results. We won't get the future we "most want" by whatever criteria, but we can get a future we like an awful lot, and are pretty satisfied with both in anticipation from our current criteria, and by our ultimate criteria.
I agree with the core argument and problem statement. To restate it briefly: there's no sharp line between manipulation and giving helpful information. This is a necessary result of humans not having well-defined goals, values, or preferences over the long term. All of those change based on circumstances, and decisions and learning along a particular path. We can't clearly distinguish manipulation, tricking me into doing what you want, from helpful information, getting me to do what I want, because what I want isn't defined.
I also agree that none of the approaches you mention provide a crisp solution or establish what human desires "really" are. They are contingent and path-dependent. One could pro...
I think this is overly committed to the hypothesis that the various concepts intuitively necessarily route through an imagined free will.
Take non-manipulation. I don't think I need to invoke non-causal free will, or even free will at all to describe that. In the deceptive case: H has some preferences. A has some preferences. A communicates falsehoods or partial truths which (predictably) cause H to have incorrect beliefs. H acts on those beliefs (in a way which furthers A's preferences).
Now, this also has the effect of making the activity in the world steer more for A's preferences than H's, at least locally. If you have an ontology where preferences arise from non-causal free will or whatever, then you might also say that A's free will subdued H's in that exchange, I guess.
Am I correctly engaging with what you're describing here?
Another example is @jacob_cannell in Empowerment is (almost) All We Need (2022). To his credit, he raises the issue that human desires are manipulable in the section “Potential Cartesian Objections”. But then he waves this issue away in a brief sentence: “These cartesian objections are future relevant, but ultimately they don't matter much for AI safety because powerful AI systems - even those of human-level intelligence - will likely need to overcome these problems regardless.”
Well .. no - looking back on that post, the 'potential cartesian objections' ar...
Path-dependence of values is defeated with aggregation over the possible paths that should have a say in what the values should be. Aggregation over many possibilities takes place in an updateless view from before those possibilities diverge. What kinds of possibilities should contribute to defining values is determined by values. And the possibilities should perhaps be shaped with the aid of aggregated values, to channel their counsel.
This sets up an analogy between CEV and updateless decision making, where the updateless core is working to define values,...
My understanding is that ARC Theory has thought some about this topic and has proposals for avoiding manipulation. I think their only public output on this is in an appendix of the ELK report 'Avoiding subtle manipulation'. I recommend reading '“Narrow” elicitation and why it might be sufficient' and 'Indirect normativity: defining a utility function' first for context.
I think @Lukas Finnveden and maybe @Eric Neyman thought about this more after this point. (My recollection is that at least @Lukas Finnveden updated toward thinking the problem was harder an...
To me the most promising solution is to get the AI to not optimize for influencing people's beliefs (3.4) except in certain permitted (often myopic) ways that depend on the situation. Some candidate guidelines:
On alternatives to brute consequentialism, I was intrigued by Grietzer's formulation of virtue as 'promote x x-ingly'. e.g. to be just is to promote justice justly, to be honest is to promote honesty honestly, ... See also. It's only a few pieces of a picture, but looked like a promising direction, to me.
Oh, hmm, good point, thanks. Let me try again:
When I think of humans who get difficult things done, or figure difficult things out, they tend to care about accomplishing those things, a lot, and in a direct and explicit way, not just e.g. as a facet of what kind of person they see themselves as. I mean, maybe “what kind of person I see myself as” has something to do with how they originally came to care about those things, but it’s not what they’re explicitly thinking about. They’re thinking directly about the object-level prize at the end of the journey, and how to get that prize.
E.g. plenty of climate change activists think of climate change activism as a good and virtuous thing to do, but I think the subset of climate change activists who are really moving the needle are the ones who are directly thinking about climate change being directly bad, and really want it to stop, and are focused directly on how to make that happen.
E.g. plenty of mathematicians think of math as a good and praiseworthy activity, but I think that the person who will solve the Riemann hypothesis will be a person who is (in addition to being smart etc.) really damn curious about why the Riemann hypothesis i...
...Yet another example is @abramdemski’s “Vingean agency” (2022) (following earlier work by Yudkowsky). He starts from a place very close to the human intuitions in §2 above, i.e. “agency” is when you can predict the outcome but not the actions leading to those outcomes. Then he hints at an intriguing idea that maybe we can just make that formal! I.e., maybe I was too quick to dismiss that kind of thing above using terms like “messed-up ontology”. (As Abram writes: “I also think it's possible that Vingean agency can be extended to be ‘the’ definition of agenc
...That suggests an approach where AIs simply learn to make a human-like gestalt assessment of what’s good vs bad (according to humans in general, and according to this particular human specifically), and then do the good things but not the bad things. Not manipulating people would, we imagine, naturally come along for the ride.
If we do something like that, I wouldn’t think of it as a solution to the True Name thing, but rather giving up on the True Name thing in favor of a different approach entirely, one relying more directly on (real or simulated) human ju
Max’s stopgap plan would involve comparing the human’s values to what they’d be if the AI did nothing. I think he understates how bad that stopgap plan is. Even providing straightforwardly-true factual information can change what a person wants, right?
Yes. That's right. And I am (among other things) worried about an AI that warps my values by telling me a series of facts.
But I want to clarify that I'm talking about terminal values, not strategic sub-goals. The stopgap plan is 100% able to tell me that the store is closed, thus changing my plan of going to ...
If engineering grade solutions to the problem are acceptable, a design for nanotech or biotech that vastly increases the amount of conventionally controllable matter/energy and doesn't come with a backdoor that instantiates another intelligence seems like straight empowerment.
The messy dilemma only come up when elements of decision-making are outsourced to AI. Tooling that makes a specific baseline human comprehensible course of action possible in the first place only gives the AI control in that the AI can limit the performance of that tooling and so pred...
The virtue-ethics-y motivation just seems more squishy and slippery than the consequentialist desire, especially when it routes through manipulable human desires, such that I’m worried it will not be an adequate bulwark against ruthless consequentialism.
Preregistered my prior before evaluating it: I suspect that this is the crux of any disagreement I'll have with your article (from the point at which it was read; this comment is currently a midstage draft). (I was right this time!)
Most of the humans whom I've seen put forward as moral and ethical exemplars...
...But alas, I currently think the most important use-case for AGI is figuring out true important things about the world (esp. related to ASI alignment and strategy) and explaining those things to the human. For this process to be effective, we cannot have an AGI that’s unconcerned with what we wind up believing after the discussion—that’s a recipe for slop-and-doom, or just an AI that’s incomprehensible and unhelpful. Rather, I want the AI to be like a disagreeable nerd that wants us to have a good understanding, notices areas where we’re confused, and is br
...Our intuitive notion of being manipulated is related to a person (call him Ahmed) taking an action A with the property that, in our intuitive causal world-models, the chain-of-causation leading to A does not ultimately trace back to the acausal force of Ahmed’s free will, but rather to the free will of some third-party who manipulated Ahmed.
(For example, if Bob deceives me about what a button does, and then I press the button, then our intuitive conceptualization of the situation says that the button was pressed ultimately because of Bob’s acausal free wil
On the topic of distinguishing between Counsel and Manipulation, without going directly through whether the resulting beliefs are accurate or not (Because that's adequately covered and I think the observation that you can manipulate through selective truths somewhat defeats it- given someone who's already wrong or who has underspecified preferences, you can probably manipulate their actions while only making their beliefs more correct.), I propose: That one(and maybe the) distinction between Counsel and Manipulation is whether the outcome of the action depends on the person being spoken to.
That is: If the AI wants and manipulates for Pink Ponies, then it will tailor its arguments to make sure, no matter who it is speaking to, they will conclude the right answer is pink ponies. We do not need to know who the AI is talking to; maybe they hate horses. Maybe they're obsessed with manly colors. Regardless, we know the outcome will end up being pink ponies. The AI may need to simulate the target in detail to manipulate them, but the AI can predict where the conversation will go without doing the simulation.
If the AI is counselling, then it will not do this. It might tell the target true ...
This was a fascinating read, which brought me back to my first year of Political Science and discussions on power and influence. This text resonates a lot with the 3 dimensions/faces of power by Steven Lukes (in Power A Radical View):
This was excellently written and helpful. I am developing a series of posts on a framing problem which underlies current alignment research, and my point connects to what you are observing here. (First part here). I would be very interested in your input on those posts. That said, those posts are not intended to look directly at the "messed-up ontology" you describe concerning empowerment, agency, manipulation, corrigibility, and so on, and I want to comment on this, because I think you are making an important observation.
Free Will
The framework you are re...
Rather, I want the AI to be like a disagreeable nerd that wants us to have a good understanding, notices areas where we’re confused, and is brainstorming and strategizing on how to help set us straight by improving its clarity and pedagogy.
It seems like it is difficult to distinguish between the disagreeable nerd described above and the same exact type of person, but who's trying to get you to have a good understanding of a conspiracy theory or religion (ie. something that they believe to be true, but is not). The conspiracy theorist or religious true beli...
...More precisely: If there are deterministic upstream explanations of what the Active Self is doing and why, e.g. via algorithmic or other mechanisms happening under the hood, then that feels like a complete undermining of one’s free will and agency. And if there are probabilistic upstream explanations of what the Active Self is doing and why, e.g. “if my stomach is empty, then I’ll start wanting food”, then that correspondingly feels like a partial undermining of free will and agency, in proportion to how confident those predictions are. For example, I migh
I totally agree that everything flounders on the messed up idea of free will. My "solution" is to abandon alignment etc. altogether and instead focus on limiting the possible damage: I don't see how, with messed up ontology, we can ever guarantee that, given enough time, an ASI won't tile the universe with hedonium - so why not focus on restricting the time? I'm pretty sure the ASI can't do it in only a few years so, if we can set it a time limit of say, 5 years and to not value anything beyond that window then we "could" be safe. This fits in with the cur...
I think the idea of "preserving human agency" in a world with AGI is even less tractable than you're gesturing at, but in a structural way. Imagine having a trusted translator navigate a bus full of strangers on a trip in Italy. I don't think there's a "true" trip that captures all reflexively preferred destinations that can be "discovered" (considering the country is large, and your trip is very short). Now imagine the translator reading all of the passengers' minds. There's a lot of path dependence in what the translator prefers and thinks the group valu...
So why can’t “agency” be tied to imperfect modeling ability too?
If we apply an analogy with temperature, then how similar is free will to an LLM's temperature causing the LLM to output slightly different token sequences which, however, convey a similar meaning? How plausible is it that similar analogies improve our understanding of humans' agency, empowerment and other important concepts?
1.1 Tl;dr
Alignment is often conceptualized as AIs helping humans achieve their goals: AIs that increase people’s agency and empowerment; AIs that are helpful, corrigible, and/or obedient; AIs that avoid manipulating people. But that last one—manipulation—points to a challenge for all these desiderata: a human’s goals are themselves under-determined and manipulable, and it’s awfully hard to pin down a principled distinction between changing people’s goals in a good way (“providing counsel”, “providing information”, “sharing ideas”) versus a bad way (“manipulating”, “brainwashing”).
The manipulability of human desires is hardly a new observation in the alignment literature, but it remains unsolved (see lit review in §3 below).
In this post I will propose an explanation of how we humans intuitively conceptualize the distinction between guidance (good) vs manipulation (bad), in case it helps us brainstorm how we might put that distinction into AI.
…But (spoiler alert) it turns out not to really help, because I’ll argue that we humans think about it in a deeply incoherent way, intimately tied to our scientifically-inaccurate intuitions around free will.
I jump from there into a broader review of every approach that I can think of for writing a “True Name” for manipulation or things related to it (empowerment, agency, corrigibility, culpability, etc.), or indeed for any other method of robustly getting future AGIs to be able to talk to people without trying to manipulate those people’s desires. I argue that none of them provides much of a path forward on the particular technical alignment problem I’m working on. Indeed, my current guess is that none of these things have a “True Name” at all, or at least not one that’s useful for the technical alignment problem.
1.2 Bigger-picture context: why is this issue so important to me?
I’ve been investigating brain-like-AGI safety plans that would involve making AGI with a motivation system loosely inspired by the prosocial aspects of human motivation. To oversimplify a bit, this kind of motivation system would include an impersonal-consequentialist aspect (related to what I call “Sympathy Reward”) that leads to wanting humans (and perhaps animals etc.) to feel more pleasure and less displeasure. But by itself, this part would make a funny kind of ruthless sociopath ASI that bliss-maxxes by, say, strapping everyone to tables on heroin drips. Or maybe it would just kill us all and tile the universe with hedonium. Granted, bliss-maxxing is not the worst possible future, as these things go. But we should aim higher!
So then the second ingredient in the motivation system would be a kinda virtue-ethics-y thing, related to what I call “Approval Reward”, which has more relation to pride, self-image, respecting other people’s preferences, and proudly internalizing social norms.
Alas, my current thinking is a bit akin to the “Nearest Unblocked Strategy” problem. If we put both those things together—a consequentialist desire plus a suite of virtue-ethics-y motivations—I’m worried that the consequentialist desire will eventually “win”. For example, if the AGI wants to eventually get to hedonium, and the AGI also wants to follow societal norms, it might find its way to hedonium via a more gradual route, one that involves gradually and unintentionally, but inexorably, changing societal norms in the direction of hedonium.[1] The virtue-ethics-y motivation just seems more squishy and slippery than the consequentialist desire, especially when it routes through manipulable human desires, such that I’m worried it will not be an adequate bulwark against ruthless consequentialism.
…Or maybe it would be fine? I’m not sure. But I’m very much on the hunt for some different or complementary approach to AI motivation, one that I can reason about more easily and have more confidence in.
So in that context, it would be nice to pin down some notion of “manipulation”, “respect for preferences”, and related notions, in a robust, well-defined way, that’s robust to specification-gaming and especially ontological crises.
For related discussion, see @johnswentworth’s discussion of “True Names” at “Why Agent Foundations? An Overly Abstract Explanation” (2022), or my own “Perils of under- vs over-sculpting AGI desires” (2025), specifically §8.2.2: “The hope of pinning down non-fuzzy concepts for the AGI to desire”.
2. How do humans intuitively define empowerment, agency, manipulation, etc.?
2.1 Background: human “free will” intuitions
Here’s a modified excerpt from my Intuitive Self Models (ISM) series, summarizing a few key points from ISM Post 3: The Active Self:
More precisely: If there are deterministic upstream explanations of what the Active Self is doing and why, e.g. via algorithmic or other mechanisms happening under the hood, then that feels like a complete undermining of one’s free will and agency. And if there are probabilistic upstream explanations of what the Active Self is doing and why, e.g. “if my stomach is empty, then I’ll start wanting food”, then that correspondingly feels like a partial undermining of free will and agency, in proportion to how confident those predictions are. For example, I might see myself as being somewhat “puppeteered” by the ghrelin hormone that my empty stomach is pumping into my bloodstream.
…Needless to say, this whole intuitive ontology is pretty messed up, in the sense that nothing in it is a veridical, observer-independent accounting of what is happening in the real world (ISM §3.3.3). And indeed, it’s somewhat specific to mainstream western culture (ISM §3.2). Outside of “mainstream western culture”, we find that other intuitive ontologies also exist; I won’t discuss them in this post since I don’t understand them very well, but I’m currently pessimistic that they will help solve my AI-alignment-related problems.[2]
2.2 Our free-will-infused intuitive notions of empowerment, agency, manipulation, corrigibility, responsibility, etc.
I think our common-sense notions of empowerment, agency, manipulation, corrigibility, and so on are intimately tied with this free-will-related intuitive ontology. In particular, I claim:
Our intuitive notion of empowerment is related to someone's acausal free will being able to accomplish whatever it wants to accomplish. Our intuitive notion of agency (in the context of e.g. “AI will enhance human agency”) is pretty similar.
Our intuitive notion of being manipulated is related to a person (call him Ahmed) taking an action A with the property that, in our intuitive causal world-models, the chain-of-causation leading to A does not ultimately trace back to the acausal force of Ahmed’s free will, but rather to the free will of some third-party who manipulated Ahmed.
(For example, if Bob deceives me about what a button does, and then I press the button, then our intuitive conceptualization of the situation says that the button was pressed ultimately because of Bob’s acausal free will working towards Bob’s desires, not because of my acausal free will working towards my desires. I was an instrument to Bob.)
Our intuitive notions of corrigibility, helpfulness, and obedience each have their own nuances, but they all substantially overlap with the above ideas: they connote increasing a supervisor’s empowerment and agency, and decreasing the amount that the supervisor gets manipulated. In other words: they suggest that important things are happening more as a result of the supervisor’s free will doing what it wants, and less as a result of other people’s (or AIs’) free wills doing what they want through the supervisor’s own actions.
For example, if a human wants to shut down an AI, the AI could prevent that by disabling the shutdown button, or the AI could prevent that by using its silver tongue to convince the human to not want to shut it down. Both of these would be contrary to what people normally mean by “corrigibility”, and in the latter case we conceptualize that as an undermining of the supervisor’s free will.
Our intuitive notions of culpability and responsibility, as in “Joe is responsible for the failure”, involves tracing back the chain-of-causation to see whose acausal force of free will it ultimately traces back to. This is kinda the flip side of manipulation (above): if I trick someone into unknowingly robbing a bank, or brainwash them into wanting to rob the bank, I would be at least partly and maybe fully responsible for the bank-robbing, because the bank got robbed ultimately because of my acausal free will, which wanted the bank to be robbed.
2.3 Another dimension: “counsel” vs “manipulation” as an emotive conjugation
There’s another dimension to how we intuitively think about these concepts: the dimension of positive or negative vibes. For example, if some kind of interaction seems good,[3] then we’re more likely to call it “providing counsel”, and if it seems bad, then we’re more likely to call it “an attempt to manipulate me”. The vibe is important in itself, over and above any particular aspect of the interaction.
I don’t think this dimension is separate from the “free will” discussion above, but rather complementary and compatible, because in general, if I have a motivation I’m happy about, I’ll tend to conceptualize it as an ego-syntonic component of my free will, while if I have a motivation I’m unhappy about, I'll tend to conceptualize it as an ego-dystonic urge undermining my free will. See ISM §3.5.4 for details.
3. If the intuitive definitions of “manipulation” etc. reside in a messed-up ontology, has the alignment literature found any alternative, better way to define these concepts?
By analogy, I think intuitive physics is a messed-up ontology in certain (far more minor) ways, and yet many intuitive physics concepts can be (imperfectly) mapped to rigorously-definable concepts in real physics. Can we find something like that for “manipulation”, “empowerment”, and so on, and then build those concepts into AI motivations?
Alas, as far as I can tell, that’s an unsolved problem, and might not have a solution at all. Here’s a brief lit review:
3.1 Compare what the human wants to what the human would want under the null policy?
First, @Max Harms in “Formal Faux Corrigibility” (2024) acknowledges that he doesn’t know how to formally define a distinction between counsel (good) vs manipulation (bad), and suggests as a stopgap to simply penalize the AI for doing either. (“This seems like a bad choice in that it discourages the AI from taking actions which help us update in ways that we reflectively desire, even when those actions are as benign as talking about the history of philosophy. Alas, I don’t currently know of a better formalism. Additional work is surely needed in developing a good measure of the kind of value modification that we don’t like while still leaving room for the kind of growth and updating that we do like.”)
Max’s stopgap plan would involve comparing the human’s values to what they’d be if the AI did nothing. I think he understates how bad that stopgap plan is. Even providing straightforwardly-true factual information can change what a person wants, right? (Update: maybe that’s an overstatement, see Max’s reply in the comments.)
Alternatively, one could take as a baseline what the human would eventually figure out on their own, given infinite time and good circumstances under which to reflect. I.e., we could say that an AI is “manipulating” if they’re pushing the person away from the conclusions of their imagined idealized copy with infinite time, and the AI is “providing counsel” if they’re pushing the person towards that. I have some concerns,[4] but yeah sure, that seems worth considering. Alas, it doesn’t solve my problem, because I have no idea what reward function, training environment, etc., could directly lead to a brain-like AGI with that (rather abstract) motivation.
3.2 The AI learns self-empowerment and generalizes to other-empowerment?
Another example is @jacob_cannell in Empowerment is (almost) All We Need (2022). To his credit, he raises the issue that human desires are manipulable in the section “Potential Cartesian Objections”. But then he waves this issue away in a brief sentence: “These cartesian objections are future relevant, but ultimately they don't matter much for AI safety because powerful AI systems - even those of human-level intelligence - will likely need to overcome these problems regardless.”
I think Jacob is suggesting that the AGI will autonomously develop a robust notion of self-empowerment, including “what it means for me (the AGI) to not get manipulated”, and then it can (somehow?) transfer that notion to humans.
If so, I’m skeptical. The main failure mode that I expect is “ruthless consequentialist AGI”, and this story really doesn’t apply there. If the AGI wants there to be paperclips, then it will instrumentally want to avoid getting ‘manipulated’, in the trivial sense that if it stops wanting paperclips then there will be fewer paperclips. This AGI would not face anything remotely analogous to the conundrum that humans don’t really know what they want for the long-term future, and are figuring it out, and when they say ‘manipulation is bad’ they are expressing some hard-to-pin-down preference about how this process of self-discovery plays out. Compare that to the AGI, which does not want self-discovery, it just wants paperclips. See also: §0.3 of my post “6 reasons why “alignment-is-hard” discourse seems alien to human intuitions, and vice-versa”.
(Update: See Jacob’s reply in the comments.)
3.3 “Vingean agency”?
Yet another example is @abramdemski’s “Vingean agency” (2022) (following earlier work by Yudkowsky). He starts from a place very close to the human intuitions in §2 above, i.e. “agency” is when you can predict the outcome but not the actions leading to those outcomes. Then he hints at an intriguing idea that maybe we can just make that formal! I.e., maybe I was too quick to dismiss that kind of thing above using terms like “messed-up ontology”. (As Abram writes: “I also think it's possible that Vingean agency can be extended to be ‘the’ definition of agency, if we think that agency is just Vingean agency from some perspective….”)
By analogy, to borrow an example from @johnswentworth, thermodynamics concepts like “temperature” are tied to imperfect modeling ability (since an omniscient observer would instead track the velocity of every particle). So why can’t “agency” be tied to imperfect modeling ability too?
But alas, even if we can rigorously define Vingean agency, I don’t think it would really help with the problem I want it to solve here, i.e. pinning down a distinction between good “counsel” vs bad “manipulation”. Vingean agency seems to solve the problem of identifying an agent trying to do something, by noticing easier-to-predict ends happening by harder-to-predict means. But the “manipulation” concept worries about the possibility of intervention upstream of a person’s ego-syntonic desires. If the AI can brainwash me into deeply wanting to maximize paperclips, and then I execute a clever plan to maximize paperclips, then I would still be a Vingean agent, as long as my clever plan was sufficiently clever (from some perspective). So the brainwashing would strip me of my intuitive agency, but not my Vingean agency.
3.4 The AI doesn’t care about (is not optimizing for) what the human winds up wanting?
Another potential approach would be to define optimization more broadly (e.g. “The Ground of Optimization”, @Alex Flint 2020), and ask whether there’s optimization in the AI towards what the human winds up deciding or wanting. The idea would be: we want the AI to provide us with relevant information, but to have no opinion either way about what we ultimately wind up wanting. We might wind up changing our desires as a result of the information, but (the story goes) it’s better that the information was not optimized to make us change our desires in a specific way.
This approach aligns pretty well with the human intuitions in §2.2 above, and more generally has a lot going for it! But alas, I currently think the most important use-case for AGI is figuring out true important things about the world (esp. related to ASI alignment and strategy) and explaining those things to the human. For this process to be effective, we cannot have an AGI that’s unconcerned with what we wind up believing after the discussion—that’s a recipe for slop-and-doom, or just an AI that’s incomprehensible and unhelpful. Rather, I want the AI to be like a disagreeable nerd that wants us to have a good understanding, notices areas where we’re confused, and is brainstorming and strategizing on how to help set us straight by improving its clarity and pedagogy. This strategizing is clearly a form of optimization, and the target of the optimization is related to the human’s eventual desires (well, it’s nominally about the human’s beliefs, but beliefs and desires are entangled), and I really think we need this kind of optimization to survive the transition to ASI.
In other words, I don’t think a brain-like AGI can successfully explain something novel and unintuitive to somebody, without caring whether the person winds up understanding it.
So this plan is out too.
3.5 Impact minimization?
Next idea: Perhaps we could rely on some notion of impact-minimization (1,2), on the grounds that changing a human’s goals has unusually large downstream impacts? For example, I would put “Instrumental Goals Are A Different And Friendlier Kind Of Thing Than Terminal Goals” (2025) by @johnswentworth and @David Lorell into this general category.
But alas, that can’t distinguish good counsel from bad manipulation, since both affect the human’s goals. As mentioned in §3.1 above, even telling a person straightforward true facts can change what they’re trying to do, in a high-impact way.
3.6 Attainable utility preservation?
“Attainable Utility Preservation” and related ideas seem to all be rooted in the messed-up ontology where agents are free to choose what to do, instead of their decisions themselves having upstream causes. So it doesn’t seem to help me here.
4. Even more ideas (that don’t really solve my problem)
That’s all I can think of that’s directly in the alignment literature, but let’s keep brainstorming!
4.1 Game theory and incentive design?
At least some of the social intuitions under discussion can be justified in a framework of game-theoretic equilibria. For example, our concept “culpability” overlaps with “a system of punishment which will set up incentives such that the end-result is overall good”. Alas, game theory tends to take for granted that people have terminal goals, and doesn’t seem to offer a useful framework for thinking about people changing each other’s terminal goals in good ways versus bad.
4.2 The person’s judgments of what kinds of interactions are good vs bad?
In §2.3, I mentioned that a big part of how we think about “counsel vs manipulation” is simply a gestalt feeling that some interaction is good vs bad.
That suggests an approach where AIs simply learn to make a human-like gestalt assessment of what’s good vs bad (according to humans in general, and according to this particular human specifically), and then do the good things but not the bad things. Not manipulating people would, we imagine, naturally come along for the ride.
If we do something like that, I wouldn’t think of it as a solution to the True Name thing, but rather giving up on the True Name thing in favor of a different approach entirely, one relying more directly on (real or simulated) human judgment. See e.g. “Act-based approval-directed agents”, for IDA skeptics.
This kind of approach will probably sound like the very obvious solution for readers who work on LLMs. No comment on LLMs, but for the problem I’m working on (brain-like AGI), it just brings me right back to where I started in §1.2: if we’re learning what’s good by the gestalt of human judgment and culture, and if human judgment and culture can themselves be gradually shifted over time, then this might not be an adequate bulwark against the AGI’s consequentialist desires. (And I do think we need the AGI to have some consequentialist desires.)
4.3 “It’s a messed-up ontology, but who cares?”
I care! The problem I see is: we should generically expect AGI (and even more, ASI) to eventually wind up with true beliefs, and with concepts that closely track the world as it really is. And its desires will be connected to those concepts seeming good or bad.
Basically, the better you’re able to model someone, the less coherent is the idea that they are expressing their agency, that they’re empowered, that you are or aren’t manipulating them, etc. Why? Because their decisions, and even their deepest, truest desires, really are downstream of their manipulable environment, situation, biology, etc.
By analogy, when you’re writing traditional UI code or balancing a pile of rocks, there isn’t really any notion of “letting the system self-actualize” or whatever. You can choose not to think about what the consequences of your coding or rock-balancing activities will be, but that’s different (see §3.4 above). And I suspect that increasingly-competent AGIs will increasingly see humans in a similar manner: they, including their “free will”, are just another real-world system that gets pushed around by circumstance, and which will predictably respond to interventions like anything else.
5. …But doesn’t this analysis equally “disprove” the possibility of human helpfulness?
And yet! Humans can be robustly helpful, right? Can we be inspired by that?
Well, one hopeful proposal would be to say: we humans are still generally using the “messed-up ontology” containing free will intuitions! Even while some of us intellectually acknowledge that the ontology is messed-up … we keep using it anyway! And gee, look at all the stuff we humans have gotten done, in terms of science, technology, governance, philosophy, etc. Maybe a “baby AGI” will develop free will intuitions for the same reasons we do, and could likewise get quite far within the messed-up ontology, without having any issues. Maybe it could get far enough to end the “acute risk period”.
More broadly, it’s unclear to me how bad wacky intuitions really are. In ISM §1.3.2.1, I bring up the example of how the moon seems to follow me at night, which implies that my intuitive visual world-model has the moon situated at the wrong distance from Earth. But who cares? That “error” doesn’t prevent me from doing anything important. I could even go work at NASA, and optimize lunar trajectories by day while watching the moon seem to follow me by night.
How does that work? Well, in the moon case, if I were optimizing lunar trajectories, that activity would be almost completely divorced from my intuitive visual moon model; I would instead be relying on intuitions developed from physics education, from pen-and-paper diagrams, from other simulated trajectories I’ve seen, and so on.
However, if we map that “solution” onto the AGI situation, it seems to bode ill; it suggests that as the AGI’s sophistication in modeling humans increases, it will be more and more divorced from its faulty free-will-related intuitive models. But those latter models are where the “manipulation” concept lives. So in this scenario, I don’t think we should expect the “manipulation” concept to effectively constrain the AGI’s planning process.
Well, let’s go back to humans again. It’s possible for humans to develop good predictive models of how to impact other people’s ego-syntonic desires. Then what? Well, by and large, they take full advantage, while conceptualizing their actions as being on the good side (“counsel”, “inspiration”, “charismatic leadership”, etc.), rather than the bad side (“manipulation”), of the relevant emotive conjugation. Thus we see books with titles like How to Win Friends and Influence People, not How to Manipulate People into Liking You and Furthering Your Agenda.
If we again map that “solution” onto the AGI situation, it again bodes ill; it suggests that an AGI’s “desire not to manipulate people” will be no constraint at all. If an AGI has a desire to follow norms, and also a desire not to manipulate people, but also a consequentialist desire to maximize paperclips, then it would gradually manipulate people into shifting norms in the direction of paperclip maximization, while telling itself that it’s not “manipulating” but rather “providing helpful counsel”.
Another human-inspired approach would be to try to dodge the issue altogether, by making the AGI incompetent at manipulating even if it wants to—just as a human can be a crack engineer but socially clueless. I have ideas about how this might work, but making them work stably and robustly seems awfully hard. A competent ASI will figure things out. You can dam a river, but eventually it will find its way to the sea.
Finally, there’s a deeper and more philosophical issue: if the intuitive way that we think about avoiding-manipulation etc. is part of a messed-up ontology, then … why am I taking it for granted that this is a good thing for me to want (for humans and/or for AGIs) in the first place? Shouldn’t I, y’know, want sensible things, rather than wanting confused nonsense things??
I sometimes say, “Luckily, we humans are not sufficiently good at philosophy to go insane.” It’s kinda a joke, but it’s also kinda not a joke. The old @Wei Dai post “Ontological Crisis in Humans” (2012) discusses (but does not answer) this question. (And of course, some people do go insane!) I have some takes, but their upshot seems to be kinda “it all adds up to normality”, so I’ll push that off to a (hopefully) future post.
6. Conclusion
My current guess is that none of these alignment-relevant concepts—empowerment, agency, being manipulated, corrigibility, helpfulness, obedience, culpability, responsibility—have any “True Names”, or at least, not ones that will be useful in practice for AI alignment.
So I guess I need to keep exploring other approaches, including approaches that I currently find harder to reason about.
Thanks Seth Herd for critical comments on an earlier draft.
You might be wondering: “Wouldn’t this argument apply to humans too? You just said the plan is inspired by human motivation systems. Yet humans don’t bliss-maxx.” …But actually, I’m thinking, maybe it’s not so crazy to bite that bullet?? See my brief earlier discussion under the heading “The arc of progress is long, but it bends towards reward hacking”.
For example, advanced meditators lack an Active Self intuitive concept (ISM §6), but I find that their replacement intuitive ontology tends to be equally messed-up, just in different ways (ISM §6.2.1). As another example, in The WEIRDest People in the World (2020), Henrich argues that non-WEIRD (Western, Educated, Industrialized, Rich, Democratic) cultures tend to have rather different intuitions related to “free will”, “responsibility”, etc., compared to WEIRD people. I, being an especially WEIRD-psychology guy even by WEIRD-country standards, struggle to understand these non-WEIRD perspectives. But from what little I understand, they don’t seem to offer a path forward on the technical alignment problem that I’m working on. Please comment if you think I’m missing something here.
The judgment of good or bad should ideally be a prospective judgment, not a judgment in hindsight. E.g. a brainwashed person would by definition be very happy (in hindsight) to have been brainwashed.
Off the top of my head: Is the “result of reflection” well-defined? (See Joe Carlsmith on “idealized values”.) E.g. would the person go crazy given literal infinite time, and if so, what do we do instead? If it had a well-defined result, would we be happy about that result? E.g. for what fraction of the population would ideal reflection converge to true beliefs etc.? Wouldn’t such an AGI be non-corrigible right now, and if so, how big a problem is that? Should we think of this kind of approach as “a way to define these funny terms like ‘manipulation’ and ‘empowerment’”, or should we think of it as “an entirely different kind of alignment target, closer to ambitious value learning”? (These questions and more are not rhetorical; I didn’t think about it much.)