Empowerment, corrigibility, etc. are simple abstractions (of a messed-up ontology)

Steven Byrnes

1.1 Tl;dr

Alignment is often conceptualized as AIs helping humans achieve their goals: AIs that increase people’s agency and empowerment; AIs that are helpful, corrigible, and/or obedient; AIs that avoid manipulating people. But that last one—manipulation—points to a challenge for all these desiderata: a human’s goals are themselves under-determined and manipulable, and it’s awfully hard to pin down a principled distinction between changing people’s goals in a good way (“providing counsel”, “providing information”, “sharing ideas”) versus a bad way (“manipulating”, “brainwashing”).

The manipulability of human desires is hardly a new observation in the alignment literature, but it remains unsolved (see lit review in §3 below).

In this post I will propose an explanation of how we humans intuitively conceptualize the distinction between guidance (good) vs manipulation (bad), in case it helps us brainstorm how we might put that distinction into AI.

…But (spoiler alert) it turns out not to really help, because I’ll argue that we humans think about it in a deeply incoherent way, intimately tied to our scientifically-inaccurate intuitions around free will.

I jump from there into a broader review of every approach that I can think of for writing a “True Name” for manipulation or things related to it (empowerment, agency, corrigibility, culpability, etc.), or indeed for any other method of robustly getting future AGIs to be able to talk to people without trying to manipulate those people’s desires. I argue that none of them provides much of a path forward on the particular technical alignment problem I’m working on. Indeed, my current guess is that none of these things have a “True Name” at all, or at least not one that’s useful for the technical alignment problem.

1.2 Bigger-picture context: why is this issue so important to me?

I’ve been investigating brain-like-AGI safety plans that would involve making AGI with a motivation system loosely inspired by the prosocial aspects of human motivation. To oversimplify a bit, this kind of motivation system would include an impersonal-consequentialist aspect (related to what I call “Sympathy Reward”) that leads to wanting humans (and perhaps animals etc.) to feel more pleasure and less displeasure. But by itself, this part would make a funny kind of ruthless sociopath ASI that bliss-maxxes by, say, strapping everyone to tables on heroin drips. Or maybe it would just kill us all and tile the universe with hedonium. Granted, bliss-maxxing is not the worst possible future, as these things go. But we should aim higher!

So then the second ingredient in the motivation system would be a kinda virtue-ethics-y thing, related to what I call “Approval Reward”, which has more relation to pride, self-image, respecting other people’s preferences, and proudly internalizing social norms.

Alas, my current thinking is a bit akin to the “Nearest Unblocked Strategy” problem. If we put both those things together—a consequentialist desire plus a suite of virtue-ethics-y motivations—I’m worried that the consequentialist desire will eventually “win”. For example, if the AGI wants to eventually get to hedonium, and the AGI also wants to follow societal norms, it might find its way to hedonium via a more gradual route, one that involves gradually and unintentionally, but inexorably, changing societal norms in the direction of hedonium.^[1] The virtue-ethics-y motivation just seems more squishy and slippery than the consequentialist desire, especially when it routes through manipulable human desires, such that I’m worried it will not be an adequate bulwark against ruthless consequentialism.

…Or maybe it would be fine? I’m not sure. But I’m very much on the hunt for some different or complementary approach to AI motivation, one that I can reason about more easily and have more confidence in.

So in that context, it would be nice to pin down some notion of “manipulation”, “respect for preferences”, and related notions, in a robust, well-defined way, that’s robust to specification-gaming and especially ontological crises.

For related discussion, see @johnswentworth’s discussion of “True Names” at “Why Agent Foundations? An Overly Abstract Explanation” (2022), or my own “Perils of under- vs over-sculpting AGI desires” (2025), specifically §8.2.2: “The hope of pinning down non-fuzzy concepts for the AGI to desire”.

2. How do humans intuitively define empowerment, agency, manipulation, etc.?

2.1 Background: human “free will” intuitions

Here’s a modified excerpt from my Intuitive Self Models (ISM) series, summarizing a few key points from ISM Post 3: The Active Self:

Sometimes we treat our own feelings as intrinsic properties of things out there in the world—Arthur is handsome, Birthdays are exciting, Capitalism is bad, Diapers are gross, etc. (ISM §3.3.2, see also Yudkowsky 2008). When we apply that general principle to “the feeling of being surprised”, we get an intuition that objects can be intrinsically unpredictable. This intuitive concept is what I call vitalistic force (ISM §3.3), and we apply it to animals, people, cartoon characters, and machines that “seem alive” (as opposed to seeming “inanimate”). It doesn’t veridically correspond to anything in the real world (ISM §3.3.3). It amounts to a sense that something has intrinsic important unpredictability in its behavior. In other words, the thing seems to be unpredictable not because we’re unfamiliar with how it works under the hood, nor because we have limited information, nor because we aren’t paying attention, etc. Rather, the unpredictability seems to be a core part of the nature of the thing itself (ISM §3.3.6).
Wanting (ISM §3.3.4) is another intuition, closely related to and correlated with vitalistic force, which comes up when a vitalistic-force-carrying entity has intrinsic unpredictability in its behavior, but we can still predict that this behavior will somehow eventually lead to some end-result systematically happening. And that end-result is described as “what it wants”. For example, if I’m watching someone sip their coffee, I’ll be surprised by their detailed bodily motions as they reach for the mug and bring it to their mouth, but I’ll be less surprised by the fact that they wind up eventually sipping the coffee. Just like vitalistic force, “wanting” is conceptualized as being acausal, i.e. an intrinsic property of an entity with no upstream cause.
The Active Self (ISM §3.3.5) is an intuitive concept, core to (but perhaps narrower than) the sense of self. It derives from the fact that the brain algorithm itself has behaviors that seem characteristic of “vitalistic force” and “wanting”. Thus we intuit that there is an entity which contains that “vitalistic force” and which does that “wanting”; that entity is what I call the “Active Self”, the wanting is what we call “ego-syntonic desires”, and the unpredictable actions in pursuit of those desires are what we call “exercises of free will”. So for example, if “I apply my free will” to do X, then the Active Self is conceptualized as the fundamental cause of X. And likewise, whenever planning / brainstorming is happening in the brain towards accomplishing X, we “explain” this fact by saying that the Active Self is doing that planning / brainstorming because it wants X. Yet again, the intuitive model requires that the Active Self must be acausal, i.e. an ultimate root cause with nothing upstream of it.

More precisely: If there are deterministic upstream explanations of what the Active Self is doing and why, e.g. via algorithmic or other mechanisms happening under the hood, then that feels like a complete undermining of one’s free will and agency. And if there are probabilistic upstream explanations of what the Active Self is doing and why, e.g. “if my stomach is empty, then I’ll start wanting food”, then that correspondingly feels like a partial undermining of free will and agency, in proportion to how confident those predictions are. For example, I might see myself as being somewhat “puppeteered” by the ghrelin hormone that my empty stomach is pumping into my bloodstream.

…Needless to say, this whole intuitive ontology is pretty messed up, in the sense that nothing in it is a veridical, observer-independent accounting of what is happening in the real world (ISM §3.3.3). And indeed, it’s somewhat specific to mainstream western culture (ISM §3.2). Outside of “mainstream western culture”, we find that other intuitive ontologies also exist; I won’t discuss them in this post since I don’t understand them very well, but I’m currently pessimistic that they will help solve my AI-alignment-related problems.^[2]

2.2 Our free-will-infused intuitive notions of empowerment, agency, manipulation, corrigibility, responsibility, etc.

I think our common-sense notions of empowerment, agency, manipulation, corrigibility, and so on are intimately tied with this free-will-related intuitive ontology. In particular, I claim:

Our intuitive notion of empowerment is related to someone's acausal free will being able to accomplish whatever it wants to accomplish. Our intuitive notion of agency (in the context of e.g. “AI will enhance human agency”) is pretty similar.

Our intuitive notion of being manipulated is related to a person (call him Ahmed) taking an action A with the property that, in our intuitive causal world-models, the chain-of-causation leading to A does not ultimately trace back to the acausal force of Ahmed’s free will, but rather to the free will of some third-party who manipulated Ahmed.

(For example, if Bob deceives me about what a button does, and then I press the button, then our intuitive conceptualization of the situation says that the button was pressed ultimately because of Bob’s acausal free will working towards Bob’s desires, not because of my acausal free will working towards my desires. I was an instrument to Bob.)

Our intuitive notions of corrigibility, helpfulness, and obedience each have their own nuances, but they all substantially overlap with the above ideas: they connote increasing a supervisor’s empowerment and agency, and decreasing the amount that the supervisor gets manipulated. In other words: they suggest that important things are happening more as a result of the supervisor’s free will doing what it wants, and less as a result of other people’s (or AIs’) free wills doing what they want through the supervisor’s own actions.

For example, if a human wants to shut down an AI, the AI could prevent that by disabling the shutdown button, or the AI could prevent that by using its silver tongue to convince the human to not want to shut it down. Both of these would be contrary to what people normally mean by “corrigibility”, and in the latter case we conceptualize that as an undermining of the supervisor’s free will.

Our intuitive notions of culpability and responsibility, as in “Joe is responsible for the failure”, involves tracing back the chain-of-causation to see whose acausal force of free will it ultimately traces back to. This is kinda the flip side of manipulation (above): if I trick someone into unknowingly robbing a bank, or brainwash them into wanting to rob the bank, I would be at least partly and maybe fully responsible for the bank-robbing, because the bank got robbed ultimately because of my acausal free will, which wanted the bank to be robbed.

2.3 Another dimension: “counsel” vs “manipulation” as an emotive conjugation

There’s another dimension to how we intuitively think about these concepts: the dimension of positive or negative vibes. For example, if some kind of interaction seems good,^[3] then we’re more likely to call it “providing counsel”, and if it seems bad, then we’re more likely to call it “an attempt to manipulate me”. The vibe is important in itself, over and above any particular aspect of the interaction.

I don’t think this dimension is separate from the “free will” discussion above, but rather complementary and compatible, because in general, if I have a motivation I’m happy about, I’ll tend to conceptualize it as an ego-syntonic component of my free will, while if I have a motivation I’m unhappy about, I'll tend to conceptualize it as an ego-dystonic urge undermining my free will. See ISM §3.5.4 for details.

3. If the intuitive definitions of “manipulation” etc. reside in a messed-up ontology, has the alignment literature found any alternative, better way to define these concepts?

By analogy, I think intuitive physics is a messed-up ontology in certain (far more minor) ways, and yet many intuitive physics concepts can be (imperfectly) mapped to rigorously-definable concepts in real physics. Can we find something like that for “manipulation”, “empowerment”, and so on, and then build those concepts into AI motivations?

Alas, as far as I can tell, that’s an unsolved problem, and might not have a solution at all. Here’s a brief lit review:

3.1 Compare what the human wants to what the human would want under the null policy?

First, @Max Harms in “Formal Faux Corrigibility” (2024) acknowledges that he doesn’t know how to formally define a distinction between counsel (good) vs manipulation (bad), and suggests as a stopgap to simply penalize the AI for doing either. (“This seems like a bad choice in that it discourages the AI from taking actions which help us update in ways that we reflectively desire, even when those actions are as benign as talking about the history of philosophy. Alas, I don’t currently know of a better formalism. Additional work is surely needed in developing a good measure of the kind of value modification that we don’t like while still leaving room for the kind of growth and updating that we do like.”)

Max’s stopgap plan would involve comparing the human’s values to what they’d be if the AI did nothing. I think he understates how bad that stopgap plan is. Even providing straightforwardly-true factual information can change what a person wants, right? (Update: maybe that’s an overstatement, see Max’s reply in the comments.)

Alternatively, one could take as a baseline what the human would eventually figure out on their own, given infinite time and good circumstances under which to reflect. I.e., we could say that an AI is “manipulating” if they’re pushing the person away from the conclusions of their imagined idealized copy with infinite time, and the AI is “providing counsel” if they’re pushing the person towards that. I have some concerns,^[4] but yeah sure, that seems worth considering. Alas, it doesn’t solve my problem, because I have no idea what reward function, training environment, etc., could directly lead to a brain-like AGI with that (rather abstract) motivation.

3.2 The AI learns self-empowerment and generalizes to other-empowerment?

Another example is @jacob_cannell in Empowerment is (almost) All We Need (2022). To his credit, he raises the issue that human desires are manipulable in the section “Potential Cartesian Objections”. But then he waves this issue away in a brief sentence: “These cartesian objections are future relevant, but ultimately they don't matter much for AI safety because powerful AI systems - even those of human-level intelligence - will likely need to overcome these problems regardless.”

I think Jacob is suggesting that the AGI will autonomously develop a robust notion of self-empowerment, including “what it means for me (the AGI) to not get manipulated”, and then it can (somehow?) transfer that notion to humans.

If so, I’m skeptical. The main failure mode that I expect is “ruthless consequentialist AGI”, and this story really doesn’t apply there. If the AGI wants there to be paperclips, then it will instrumentally want to avoid getting ‘manipulated’, in the trivial sense that if it stops wanting paperclips then there will be fewer paperclips. This AGI would not face anything remotely analogous to the conundrum that humans don’t really know what they want for the long-term future, and are figuring it out, and when they say ‘manipulation is bad’ they are expressing some hard-to-pin-down preference about how this process of self-discovery plays out. Compare that to the AGI, which does not want self-discovery, it just wants paperclips. See also: §0.3 of my post “6 reasons why “alignment-is-hard” discourse seems alien to human intuitions, and vice-versa”.

(Update: See Jacob’s reply in the comments.)

3.3 “Vingean agency”?

Yet another example is @abramdemski’s “Vingean agency” (2022) (following earlier work by Yudkowsky). He starts from a place very close to the human intuitions in §2 above, i.e. “agency” is when you can predict the outcome but not the actions leading to those outcomes. Then he hints at an intriguing idea that maybe we can just make that formal! I.e., maybe I was too quick to dismiss that kind of thing above using terms like “messed-up ontology”. (As Abram writes: “I also think it's possible that Vingean agency can be extended to be ‘the’ definition of agency, if we think that agency is just Vingean agency from some perspective….”)

By analogy, to borrow an example from @johnswentworth, thermodynamics concepts like “temperature” are tied to imperfect modeling ability (since an omniscient observer would instead track the velocity of every particle). So why can’t “agency” be tied to imperfect modeling ability too?

But alas, even if we can rigorously define Vingean agency, I don’t think it would really help with the problem I want it to solve here, i.e. pinning down a distinction between good “counsel” vs bad “manipulation”. Vingean agency seems to solve the problem of identifying an agent trying to do something, by noticing easier-to-predict ends happening by harder-to-predict means. But the “manipulation” concept worries about the possibility of intervention upstream of a person’s ego-syntonic desires. If the AI can brainwash me into deeply wanting to maximize paperclips, and then I execute a clever plan to maximize paperclips, then I would still be a Vingean agent, as long as my clever plan was sufficiently clever (from some perspective). So the brainwashing would strip me of my intuitive agency, but not my Vingean agency.

3.4 The AI doesn’t care about (is not optimizing for) what the human winds up wanting?

Another potential approach would be to define optimization more broadly (e.g. “The Ground of Optimization”, @Alex Flint 2020), and ask whether there’s optimization in the AI towards what the human winds up deciding or wanting. The idea would be: we want the AI to provide us with relevant information, but to have no opinion either way about what we ultimately wind up wanting. We might wind up changing our desires as a result of the information, but (the story goes) it’s better that the information was not optimized to make us change our desires in a specific way.

This approach aligns pretty well with the human intuitions in §2.2 above, and more generally has a lot going for it! But alas, I currently think the most important use-case for AGI is figuring out true important things about the world (esp. related to ASI alignment and strategy) and explaining those things to the human. For this process to be effective, we cannot have an AGI that’s unconcerned with what we wind up believing after the discussion—that’s a recipe for slop-and-doom, or just an AI that’s incomprehensible and unhelpful. Rather, I want the AI to be like a disagreeable nerd that wants us to have a good understanding, notices areas where we’re confused, and is brainstorming and strategizing on how to help set us straight by improving its clarity and pedagogy. This strategizing is clearly a form of optimization, and the target of the optimization is related to the human’s eventual desires (well, it’s nominally about the human’s beliefs, but beliefs and desires are entangled), and I really think we need this kind of optimization to survive the transition to ASI.

In other words, I don’t think a brain-like AGI can successfully explain something novel and unintuitive to somebody, without caring whether the person winds up understanding it.

So this plan is out too.

3.5 Impact minimization?

Next idea: Perhaps we could rely on some notion of impact-minimization (1,2), on the grounds that changing a human’s goals has unusually large downstream impacts? For example, I would put “Instrumental Goals Are A Different And Friendlier Kind Of Thing Than Terminal Goals” (2025) by @johnswentworth and @David Lorell into this general category.

But alas, that can’t distinguish good counsel from bad manipulation, since both affect the human’s goals. As mentioned in §3.1 above, even telling a person straightforward true facts can change what they’re trying to do, in a high-impact way.

3.6 Attainable utility preservation?

“Attainable Utility Preservation” and related ideas seem to all be rooted in the messed-up ontology where agents are free to choose what to do, instead of their decisions themselves having upstream causes. So it doesn’t seem to help me here.

4. Even more ideas (that don’t really solve my problem)

That’s all I can think of that’s directly in the alignment literature, but let’s keep brainstorming!

4.1 Game theory and incentive design?

At least some of the social intuitions under discussion can be justified in a framework of game-theoretic equilibria. For example, our concept “culpability” overlaps with “a system of punishment which will set up incentives such that the end-result is overall good”. Alas, game theory tends to take for granted that people have terminal goals, and doesn’t seem to offer a useful framework for thinking about people changing each other’s terminal goals in good ways versus bad.

4.2 The person’s judgments of what kinds of interactions are good vs bad?

In §2.3, I mentioned that a big part of how we think about “counsel vs manipulation” is simply a gestalt feeling that some interaction is good vs bad.

That suggests an approach where AIs simply learn to make a human-like gestalt assessment of what’s good vs bad (according to humans in general, and according to this particular human specifically), and then do the good things but not the bad things. Not manipulating people would, we imagine, naturally come along for the ride.

If we do something like that, I wouldn’t think of it as a solution to the True Name thing, but rather giving up on the True Name thing in favor of a different approach entirely, one relying more directly on (real or simulated) human judgment. See e.g. “Act-based approval-directed agents”, for IDA skeptics.

This kind of approach will probably sound like the very obvious solution for readers who work on LLMs. No comment on LLMs, but for the problem I’m working on (brain-like AGI), it just brings me right back to where I started in §1.2: if we’re learning what’s good by the gestalt of human judgment and culture, and if human judgment and culture can themselves be gradually shifted over time, then this might not be an adequate bulwark against the AGI’s consequentialist desires. (And I do think we need the AGI to have some consequentialist desires.)

4.3 “It’s a messed-up ontology, but who cares?”

I care! The problem I see is: we should generically expect AGI (and even more, ASI) to eventually wind up with true beliefs, and with concepts that closely track the world as it really is. And its desires will be connected to those concepts seeming good or bad.

Basically, the better you’re able to model someone, the less coherent is the idea that they are expressing their agency, that they’re empowered, that you are or aren’t manipulating them, etc. Why? Because their decisions, and even their deepest, truest desires, really are downstream of their manipulable environment, situation, biology, etc.

By analogy, when you’re writing traditional UI code or balancing a pile of rocks, there isn’t really any notion of “letting the system self-actualize” or whatever. You can choose not to think about what the consequences of your coding or rock-balancing activities will be, but that’s different (see §3.4 above). And I suspect that increasingly-competent AGIs will increasingly see humans in a similar manner: they, including their “free will”, are just another real-world system that gets pushed around by circumstance, and which will predictably respond to interventions like anything else.

5. …But doesn’t this analysis equally “disprove” the possibility of human helpfulness?

And yet! Humans can be robustly helpful, right? Can we be inspired by that?

Well, one hopeful proposal would be to say: we humans are still generally using the “messed-up ontology” containing free will intuitions! Even while some of us intellectually acknowledge that the ontology is messed-up … we keep using it anyway! And gee, look at all the stuff we humans have gotten done, in terms of science, technology, governance, philosophy, etc. Maybe a “baby AGI” will develop free will intuitions for the same reasons we do, and could likewise get quite far within the messed-up ontology, without having any issues. Maybe it could get far enough to end the “acute risk period”.

More broadly, it’s unclear to me how bad wacky intuitions really are. In ISM §1.3.2.1, I bring up the example of how the moon seems to follow me at night, which implies that my intuitive visual world-model has the moon situated at the wrong distance from Earth. But who cares? That “error” doesn’t prevent me from doing anything important. I could even go work at NASA, and optimize lunar trajectories by day while watching the moon seem to follow me by night.

How does that work? Well, in the moon case, if I were optimizing lunar trajectories, that activity would be almost completely divorced from my intuitive visual moon model; I would instead be relying on intuitions developed from physics education, from pen-and-paper diagrams, from other simulated trajectories I’ve seen, and so on.

However, if we map that “solution” onto the AGI situation, it seems to bode ill; it suggests that as the AGI’s sophistication in modeling humans increases, it will be more and more divorced from its faulty free-will-related intuitive models. But those latter models are where the “manipulation” concept lives. So in this scenario, I don’t think we should expect the “manipulation” concept to effectively constrain the AGI’s planning process.

Well, let’s go back to humans again. It’s possible for humans to develop good predictive models of how to impact other people’s ego-syntonic desires. Then what? Well, by and large, they take full advantage, while conceptualizing their actions as being on the good side (“counsel”, “inspiration”, “charismatic leadership”, etc.), rather than the bad side (“manipulation”), of the relevant emotive conjugation. Thus we see books with titles like How to Win Friends and Influence People, not How to Manipulate People into Liking You and Furthering Your Agenda.

If we again map that “solution” onto the AGI situation, it again bodes ill; it suggests that an AGI’s “desire not to manipulate people” will be no constraint at all. If an AGI has a desire to follow norms, and also a desire not to manipulate people, but also a consequentialist desire to maximize paperclips, then it would gradually manipulate people into shifting norms in the direction of paperclip maximization, while telling itself that it’s not “manipulating” but rather “providing helpful counsel”.

Another human-inspired approach would be to try to dodge the issue altogether, by making the AGI incompetent at manipulating even if it wants to—just as a human can be a crack engineer but socially clueless. I have ideas about how this might work, but making them work stably and robustly seems awfully hard. A competent ASI will figure things out. You can dam a river, but eventually it will find its way to the sea.

Finally, there’s a deeper and more philosophical issue: if the intuitive way that we think about avoiding-manipulation etc. is part of a messed-up ontology, then … why am I taking it for granted that this is a good thing for me to want (for humans and/or for AGIs) in the first place? Shouldn’t I, y’know, want sensible things, rather than wanting confused nonsense things??

I sometimes say, “Luckily, we humans are not sufficiently good at philosophy to go insane.” It’s kinda a joke, but it’s also kinda not a joke. The old @Wei Dai post “Ontological Crisis in Humans” (2012) discusses (but does not answer) this question. (And of course, some people do go insane!) I have some takes, but their upshot seems to be kinda “it all adds up to normality”, so I’ll push that off to a (hopefully) future post.

6. Conclusion

My current guess is that none of these alignment-relevant concepts—empowerment, agency, being manipulated, corrigibility, helpfulness, obedience, culpability, responsibility—have any “True Names”, or at least, not ones that will be useful in practice for AI alignment.

So I guess I need to keep exploring other approaches, including approaches that I currently find harder to reason about.

Thanks Seth Herd for critical comments on an earlier draft.

^{^}
You might be wondering: “Wouldn’t this argument apply to humans too? You just said the plan is inspired by human motivation systems. Yet humans don’t bliss-maxx.” …But actually, I’m thinking, maybe it’s not so crazy to bite that bullet?? See my brief earlier discussion under the heading “The arc of progress is long, but it bends towards reward hacking”.
^{^}
For example, advanced meditators lack an Active Self intuitive concept (ISM §6), but I find that their replacement intuitive ontology tends to be equally messed-up, just in different ways (ISM §6.2.1). As another example, in The WEIRDest People in the World (2020), Henrich argues that non-WEIRD (Western, Educated, Industrialized, Rich, Democratic) cultures tend to have rather different intuitions related to “free will”, “responsibility”, etc., compared to WEIRD people. I, being an especially WEIRD-psychology guy even by WEIRD-country standards, struggle to understand these non-WEIRD perspectives. But from what little I understand, they don’t seem to offer a path forward on the technical alignment problem that I’m working on. Please comment if you think I’m missing something here.
^{^}
The judgment of good or bad should ideally be a prospective judgment, not a judgment in hindsight. E.g. a brainwashed person would by definition be very happy (in hindsight) to have been brainwashed.
^{^}
Off the top of my head: Is the “result of reflection” well-defined? (See Joe Carlsmith on “idealized values”.) E.g. would the person go crazy given literal infinite time, and if so, what do we do instead? If it had a well-defined result, would we be happy about that result? E.g. for what fraction of the population would ideal reflection converge to true beliefs etc.? Wouldn’t such an AGI be non-corrigible right now, and if so, how big a problem is that? Should we think of this kind of approach as “a way to define these funny terms like ‘manipulation’ and ‘empowerment’”, or should we think of it as “an entirely different kind of alignment target, closer to ambitious value learning”? (These questions and more are not rhetorical; I didn’t think about it much.)

(I wrote this in reply to a draft; apologies if the post has been load-bearingly updated since then.)

Consider a future AGI. The best argument I currently know of for why "general intelligence" is a thing at all (as opposed to all intelligence just boiling down to a bag of use-case-specific heuristics) is that search/optimization/world-modeling/planning/etc are naturally recursive; problems factor into subproblems, goals factor into subgoals. So, a natural general-purpose architecture for intelligence involves a "general intelligence module" which can take in a (sub)problem, and recursively pass (sub)subproblems to other instances of the general intelligence module.

Let's assume that general intelligence either does look vaguely like that, or at least can look vaguely like that. (This would be a potentially useful assumption to disagree with!)

Assuming that structure, consider how the different instances of the general intelligence module relate to each other. One module might be able to more efficiently achieve its own subgoal A by hacking/manipulating/overwriting another module's subgoal B; then there would be two modules working toward subgoal A. But a "general intelligence" made o... (read more)

I interpret you as saying:

“Let’s say my wool sweater is wet but I want to wear it outside when I leave in 10 minutes. So I have a goal (look nice tonight) and a subgoal (dry my sweater within 10 minutes). Some “module” in my brain is focused on the subgoal, and one thing that could happen is: it finds that there’s really only one way to accomplish the subgoal, and it’s to put the sweater in the dryer on the highest possible heat. It then notices that this will be shrink the sweater to look terrible, which is a problem according to the “look nice tonight” goal, so the “module” “solves” that problem by simply deleting the “look nice tonight” goal! …This scenario doesn’t actually happen, so that calls for an explanation of why not, and that explanation (whatever it is) will be a solution to corrigibility.”

Assuming that’s what you meant: I think the reason the scenario doesn’t happen (in humans) is because “modules” is not as literal a thing as you make it out to be. A better analogy is, like, looking at a big painting. You can’t take in the whole thing at once (among other things, your vision is only sharp at the fovea), so you focus on one part, and then another part, etc. But whatev... (read more)

The sweater example is close but doesn't quite hit the nail on the head. It's not that the (dry sweater) planner would delete the (look nice) goal; that wouldn't help dry the sweater! Rather, the (dry sweater) planner would try to commandeer more mental resources, i.e. more planner-modules, steer more attention to drying the sweater. That additional attention potentially helps dry the sweater. But as an accidental side-effect, attention would be steered away from other goals, like e.g. (look nice). In short, the (dry sweater) planner-module can better dry the sweater by redirecting the (look nice) module to focus on drying the sweater instead.

... and that totally does happen in humans! Humans often get caught up in a specific subgoal, lose track of the broader goal which generated that subgoal in the first place, and end up optimizing for the subgoal in a way which doesn't help the original goal. It's the phenomenon of lost purposes, at an individual level.

(Likewise with the painting example: when looking at little patches of a large complex painting, people will totally lose track of context and overlook inconsistencies.)

It really doesn't seem like humans "keep their eye on the ball" all the time, even in the large majority of day-to-day cases where cognition basically works.

4Steven Byrnes25d

Thanks! We can divide things into an inference algorithm (what to do now) and a learning algorithm (how to change stored parameters such that I do better in the future). These correspond respectively to search / planning / foresight, and to RL / updating-from-mistakes-and-surprises. In humans, generally both the inference algorithm and learning algorithm are working together throughout life. (Although in the ASI context, as intelligence and knowledge goes up, we expect foresight to get better, and thus fewer mistakes and surprises, and thus the balance tilts towards the inference algorithm over the learning algorithm.) So yeah, sure, people will sometimes pursue a subgoal while losing track of the actual goal, because the inference algorithm is imperfect. But then the person would later on notice that they failed to achieve their actual goal, and see that as a bad thing, and then the learning algorithm will kick in and help them avoid a similar mistake next time. This system is not perfect, but we generally get by, especially in familiar situations. I still don’t think there’s any lesson here for AI corrigibility, except what I said before: “the AI keeps the supervisor’s desires / preferences / aversions at the back of its mind, and notices when it has an idea that would bother or offend the supervisor, and then not do it”. I guess that sentence was only discussing the inference-algorithm part, and I omitted the corresponding learning-algorithm part, which would be: “…and also, the AI notices if it violates the supervisor’s desires / preferences / aversions despite its intentions, and when that happens, the AI feels bad and thinks about how to do better next time”. This (the inference-algorithm part and learning-algorithm part together) comprises an approach to corrigibility that I think many people treat as the obvious default plan.

1DeathDriveAnnabelle4d

A little painting-sidenote: "You can’t take in the whole thing at once (among other things, your vision is only sharp at the fovea), so you focus on one part, and then another part, etc." That is, in a sense, not entirely true. Paintings can be like eyesight on purpose. Impressionism (or some "schools" of Impressionism) is "about" that, really. The idea of having a whole scenario unfold in front of you, all things at once, and because of the retinal limitations of focus, think about what is a "pusher" in a painting, and paint it according what you want your leading effects to be, what you have identified as such. Precisely to not noodle up a collage of single items and render them generically (detailed). You are interpreting the rest of the painting anyway, but you have leading effects.

1Q Home25d

I think it totally does suggest a solution, if you could reformulate the quote in a much more abstract and general way, along with explaining how/why the AI doesn't fool the supervisor. I was skimming the intro to brain-like AGI safety and IIRC the problem "how/why doesn't the thought generator fool the thought accessor?" wasn't solved beyond "create as many thought accessors as possible and hope for the best". In general, I'm very interested in the intersection between brain-like ideas and agent foundations ideas.

2Steven Byrnes25d

This is off-topic here but happy to chat about it anyway. The thought generator generates thoughts, and the thought assessor judges those thoughts as good or bad. Neither of these modules is an intelligent agent, so I don’t know what it means for the thought generator to “fool” the thought assessor. It’s not like the thought assessor has goals and aspirations which can be undermined; rather, the thought assessor is just a machine that stamps “good” or “bad” on different thoughts. Is it possible for thoughts to get stamped “good” or “bad” for weird reasons? Absolutely! We might call that “reward hacking” (although what constitutes reward hacking is in the eye of the beholder), and it is ubiquitous in everyday human life. (See my discussion under the heading “The arc of progress is long, but it bends towards reward hacking”.) If you ask me “what’s the solution to reward hacking in brain-like AGI?”, then my answer is “I don’t know”. I continue to brainstorm, and sometimes share my ill-formed half-baked thoughts in posts like this one! :) See e.g. My AGI safety research—2025 review, ’26 plans.

1Q Home25d

Sorry, why is this offtopic - what are the different problems I could've confused? The thing I quoted feels like it assumes exactly the reward hacking problem being solved: [...] It assumes we have a thing which can accurately judge ideas as "good" or "bad". I don't believe it's a solved problem. I went on to re-read the "act-based approval-directed agents" post, but it offered no solution. This reply to John seems to draw a distinction between good foresight and good hindsight, but I'd say both are unsolved problems and might have a similar solution. The latter is the credit assignment problem and I remember your post about it, with no general solution. I'll share my general view, maybe it'll help to sort out the misunderstanding: I think the brain is a horrible mess. "Why does the whole system stay relatively stable, without drifting uncontrollably into arbitrary directions?" is the core alignment-related mystery about it. "How is the brain able to correctly introspect its thoughts?", "how is the brain able to notice failures/successes of its high-level goals?", etc. are all facets of the core mystery. I believe they are all unsolved and not obvious. IMO any solution to any of the facets will give a (non-trivial) insight about corrigibility.

4Steven Byrnes25d

When you keep saying “it’s not a solved problem” and “no solution”, are you saying that there are important mysteries about how the human brain works, or that there are important mysteries about how to solve the technical alignment problem for brain-like AGI? If the latter, yes I strongly agree, I say that all the time. If you think I have ever claimed to know a good plan for technical alignment (or corrigibility etc.) then either I made a typo or you misunderstood me. For example, when I wrote “the AI keeps the supervisor’s desires / preferences / aversions at the back of its mind …”, the context was a discussion of high-level desiderata / research directions, not nuts-and-bolts technical plans. If the former (if you’re saying that there are important mysteries about how the human brain works), then I’m confused that you’re making this comment in a conversation about AI alignment, and linking to other posts on AI alignment. There’s a question of whether or not my neuro framework can explain every aspect of human psychology / behavior / experience, and that question is unrelated to AI alignment. (I obviously think the answer is yes, and if you disagree then we can talk about the aspects of human experience that you think it can’t explain, but please leave AI out of that conversation.)

4Q Home25d

I'd say "we don't understand how the brain works good enough to replicate it in AGI" (and that means we lack a really important layer of understanding). Anyway, I think we agree here. The point I wanted to make: * I think John's argument for why corrigibility is natural works even if you don't assume crisp separate modules. * Whatever explanation you propose (of the brain staying coherent), be it "your modules are corrigible to each other" or "the goals are kept at the back of your mind", if you try to make that explanation more technical you'll probably end up with a general solution to corrigibility. Sorry for skipping so much inferential steps and causing confusion.

4Seth Herd25d

Our general intelligence does not appear to work like that. The AGI we build could and even may look like that; I'm not sure. Our brains don't really pass subgoals to submodules or subagents. Instead we turn most of our mind (global workspace, closely tied higher-level cortical regions) to the subgoal while we're working on it. It's a serial approach. This seems pretty clear from neural recording data in humans and finer-grained data from animals. And from indirect reasoning from behavioral data. Humans can parallelize tasks reasonably efficiently only when they're in totally different sensorimotor domains, so separate brain regions can work on them. And even doing this efficiently takes a ton of practice. It's not default brain function. So the same reward signals are training your "general intelligence" as training your "subgoal solver". Sometimes people do sort of get hacked by subgoals, when they become obsessed with a challenge. They usually recover when over practical matters (food, shelter, social approval) become pressing enough that those reward signals dominate over the predicted reward from solving those problems. LLMs currently work like this too. An LLM AGI might spin out subagents for subgoals, and thereby face the problem you mention. Or it might not; it would be more straightforward to mostly stay single-tracked and rely on subagents only for particularly parallelizable subgoals. If ASI does use a parallel approach like you mention, we or our AIs will have to solve the alignment risks of the central general intelligence being manipulated/hacked by subagents.

John's point applies even if you are doing subgoal solving serially. When trying to get food, I don't naturally consider plans like "shove my fingers down my throat to induce vomiting, so that I'm hungrier and thus buy more food" - even bulimics aren't inducing vomiting as a plot to get their future self to eat more, I don't think!

However, I will consider plans like "tie a string around my finger so I remember to buy food" or "set a reminder".

The difference of course is that I don't like vomiting and think it's unhealthy. When I turn my cognition towards solving the problem of "buy food", this time slice of me still remains 'aligned'. This is true even when I do complicated metareasoning about my memory, attention, and "external cognition" (by my phone) - I have no problem planning around my memory or attention or even thoughts like "I shouldn't open chats right now, because then I'll get sucked in and not but food".

To be clear, this isn't about how good I am at planning around myself; it's about the fact I seriously consider such plans at all, but don't seriously consider vomiting. And as you say, in the moment you sometimes do have different values - but we still need to explain how you can be 'corrigible' to the rest of your problem solving process (I'm not tied to the frame of corrigibility).

4StanislavKrym26d

How plausible is it that there is an offense-defence balance? Imagine that there is a metric of the ability to overwrite someone else's goals, to protect one's goals from being overwritten and that the latter metric increases with intelligence faster than the former. Then a collective of agents with a high intelligence would be immune to such attacks, while a collective of agents with low intelligence would be vulnerable to epidemics where a subgoal spreads to the whole collective. How similar is the latter to failure modes present in human collectives?

2kbear26d

i agree that a system with many unpatched vulnerabilities in its general reasoning module will rapidly diverge. but consider that when a node discovers such a technique (even if motivated by offensive capabilities), its first move may be to patch itself and its descendants. so an immunity can form. as a result, the overwriting attempts may tend to look more like... normal argumentation. descriptions of evidence, alternative mechanistic models, voiced disagreement. these things seem convergent[1], in the sense that a general reasoner would -- upon discovering that they can be influenced by such items -- not immediately self-modify to remove that vulnerability. to summarize: the general reasoner may at first be vulnerable to certain 'weird strings', but should be able to patch these until searching for them is no longer valuable. 1. ^ assuming that they have some tools for spotting lies, such as somewhere they can do empiricism

2Charlie Steiner26d

One sufficient property is the "not acting in the real world" subtype of corrigibility- the subproblem-solver needs to not solve the real-world subproblem that could be better-solved by overthrowing the mental hierarchy, it's given a transformed alternate-world problem that can be mapped back onto a solution to the real-world subproblem when it's done. E.g. I need food, so I plan a trip to the store, but as a problem specified within a trip-planning-appropriate ontology that has actions like "turn left at the light" but not actions like "brainwash myself to want only going to the store." Is this too unsatisfying? But I think humans probably do it by not cleanly recursing, and occasionally checking in with various heuristics for the superproblem.

1Jono1d

Human submodules cooperate because they know that they depend on the parent module for future instantiation. Any submodule with a track record of destroying peers is a costly submodule to run. If a module in an ASI judges that it should be instantiated again after submodules spawn, it should carefully aim them such that they will recreate the conditions that spawned it.

1Jono1d

I imagine proper submodules seek that they are reinstantiated as few future times as possible. A submodule that cheated the task (by eg killing the host) reinstantiates when another module (in a copy ASI or in another human) rediscovers the open problem. (I'm treating modules as identified by their goal) We (and I imagine, ASIs that have no logical omniscience) can misspecify a submodule, but this submodule won't fight us hard once we learn we were mistaken in instantiating it. It, seeking minimum reinstantiation, just wishes us to promise it we won't forget we don't need it, and then it vanishes. To summarize with a redundant how-to for getting subgoals done through submodule instantiation: make a submodule and tell it that it is characterized by your subgoal and request it to ensure that as few possible total resources are spent by itself and all future modules with the same characteristic. (that last bit would be more accurate than "minimum reinstantiation")

1J Bostock25d

My guess would be that human mind sub-modules have commandeered the predictive coding "handshake" procedure which signals "close enough, basically no discrepancies here". A planning submodule gets given a subgoal from the higher up module, and works until it hits "close enough". Possibly some of the "close enough" can come from treating ```await subsubmodule(subsubgoal)``` macros as having provisionally returned ```close_enough```, which then gets updated to something else once the subsubmodule returns. Unfortunately, I also don't know how to construct this kind of thing!

A sentence I have found myself saying maybe 3-4 dozen times in the last year has been "I don't think corrigibility as an approach will scale to vastly superintelligent agents, we will have to figure out proper alignment/value loading at some point before then". Most recently someone asked me to explain why I believed this, but I noticed I did not have a great writeup available.

Now I do! And much more than that, when I read this post I found my own thinking on this topic get noticeably clearer and less confused, not just on the topic of corrigibility, but also on other topics of AI strategy, with little gems like:

This AGI would not face anything remotely analogous to the conundrum that humans don’t really know what they want for the long-term future, and are figuring it out, and when they say ‘manipulation is bad’ they are expressing some hard-to-pin-down preference about how this process of self-discovery plays out.

Which is a mini-paragraph that I feel like connects two things that I have been trying to connect for a while, namely "CEV as a necessary component of a good future", and "AI manipulation is a very fundamentally tricky problem".

Thanks a lot for writing this.

2N1X6d

I was starting to draft roughly this note, and I"m glad it split out from the longer messier thread where they acted like manipulation vs. guidance was unsolveable mess... yes, it's messy, but there are clear-cut cases! People can consent to manipulation and then it's guidance, therapy, life coaching, or the like. People can (sometimes) figure out what types of manipulations they'd retroactively consent to and pre-consent to those (this is rarer but not unheard of). Going any further in extrapolating volition risks all sorts of assumptions about the similarity of cognitive architectures among persons and through time, but the "it's all completely impossible" tone (paraphrasing my reading of it, not quoting anyone) was beginning to grate on me!

3rif a. saurous3d

Do you think the clear-cut cases are a large, central class? The number of people who consent to coaching or therapy but later end up feeling that their coach or therapist manipulated them is ... not almost zero, and that's even if we ignore organizations that most of us label as cults. Consent to coaching or therapy is an a priori prediction that your future self expects to find the changes beneficial, based on your estimates of the character and abilities of the coach or therapist. It's not a blanket claim that anything the coach or therapist might possibly do will be viewed as "help" a posteriori. Moving up a level to @habryka's comment, I think that appealing to CEV does distinguish help from manipulation. But personally I don't think CEV is a true name of anything either, so I don't think this helps. I think humans (and likely at least "brain-like AGIs" which will inherit highly similar algorithms) are bundles of contradictory desires that don't have coherent extrapolations. But if you think CEV is a thing, I think I agree that we can use it to distinguish help and manipulation. (Conditioned on CEV being well-defined, present consent is of course also not a good predictor of CEV.) I note that I didn't have to refer to "acausal free will" at all in this comment. I do think @Steven Byrnes is correct that humans make heavy use of (non-veridical) acausal free will in their ontologies, but I don't think that's needed for this argument.

I think you're correctly identifying important issues and cracks in the standard ontology, but I think you're throwing out too much baby in an effort to get rid of bathwater.

For example, I do not think it's obvious that "Just like vitalistic force, 'wanting' is conceptualized as being acausal, i.e. an intrinsic property of an entity with no upstream cause." In control theory, we can say that a system controls for a thing based on a small collection of mathematical relationships -- pressuring an error signal towards zero. While the concept of wanting is overloaded and more complex, I think it makes sense to recognize that "X is controlling for Y" is a valid underpinning that has no vitalistic magic. We can ask what led X to control for Y, or how X controls for Y in terms that are closer to the underlying physics; there's nothing acausal or intrinsic (except for the definitions, I suppose).

It’s possible for an intuitive human concept X to be a bundle of connotations that don’t add up to anything coherent, while ALSO there’s some similar concept Y that is mathematically well-defined and is capturing many (but not all) of those connotations. Then we can argue about whether X is “really” an imperfect pointer to Y, versus whether X and Y are different but related. But that’s a pointless argument with no answer.

(Fun example: Dan Dennett wrote a book advocating for free will compatibilism in 1984, and then in 2015 added a new preface saying: well actually on second thought, maybe I should have just said all along that we should abandon the term “free will” altogether.)

Anyway, I stand by my claim that acausality is an aspect of how most people intuitively think about wanting. Here’s an example … it’s possible that this tweet is bad-faith, but regardless, I think Marc wouldn’t have said it if it didn’t have some intuitive appeal:

So anyway, we can say that our intuitions around “wanting” have that incoherent aspect (per §2), and we can simultaneously ALSO say that there are well-defined notions of optimization (e.g. in §3.4 I cite Alex Flint’s) that overlap many aspects of th... (read more)

4Max Harms24d

Fair enough. And I certainly agree that there is a lot of bathwater! The bundle of connotations attached to the word "wanting" is a mess. I just want to flag that it seems to me that much of the normal ontology can be rescued, albeit with a little bit of work. I claim that concepts like corrigibility are still useful and coherent once the rescuing has taken place.

3RogerDearnaley23d

I think the ontology Steven describes makes a lot of sense for an organism in any environment that contain rocks and plants (easy to predict), weather patterns and running water (hard to predict short term, Vingean approaches of identifying their goals are not very helpful), and other humans and sentient animals (hard to predict short term, but Vingean approaches of identifying their goals are very helpful). This is a good heuristic for an evolved animal to recognize other evolved animals. So of course we have special-purpose learning modules designed for it, and can make this judgement without effort. In fact, since it's an important part of our threat monitoring systems, our special-purpose abilities in this area are tuned towards avoiding false negatives and tolerating some false positives.

I think this problem is real, and it's not solvable in the limit. But might be fairly easily solvable to a fairly satisfactory degree. Drawing the line between manipulation and help requires a judgment call. But that call can be made by humans. This should give decent results. We won't get the future we "most want" by whatever criteria, but we can get a future we like an awful lot, and are pretty satisfied with both in anticipation from our current criteria, and by our ultimate criteria.

I agree with the core argument and problem statement. To restate it briefly: there's no sharp line between manipulation and giving helpful information. This is a necessary result of humans not having well-defined goals, values, or preferences over the long term. All of those change based on circumstances, and decisions and learning along a particular path. We can't clearly distinguish manipulation, tricking me into doing what you want, from helpful information, getting me to do what I want, because what I want isn't defined.

I also agree that none of the approaches you mention provide a crisp solution or establish what human desires "really" are. They are contingent and path-dependent. One could pro... (read more)

5Charlie Steiner26d

I don't want my ASI to interact with me in whatever way maximizes pretty pink ponies but that I-inside-the-thought-experiment wouldn't consider manipulation. I-outside-the-thought-experiment expect this would lead to severe manipulation, even though I-inside-the-thought-experiment wouldn't agree!

4Seth Herd25d

Well, I don't want that either, but I think it's an acceptable level of not-getting-exactly-what-we-want. It seems like it would still be something among the many things you want. You're saying it will find methods that you wouldn't consider manipulation, but that will work just fine to convince you you want bunches of ponies? I think that means you're fine with ponies; it's just one of the many futures you'd like a lot. Do you think it could manipulate you into things that you-now would find repugnant, while never manipulating you by your standards? That seems contradictory. If my ASI says "You asked my to tell you anything I'm doing that you might consider manipulative. Here's the biggest one. I like pink ponies a lot, so I'm going to keep presenting possible futures in ways that emphasize how awesome pink ponies are. I think you'll ultimately agree with me." You could either say "okay fine that doesn't seem like manipulation" or "don't do that, that's manipulative!" And you'll get more chances to veto. If it never lies to you when asked, it seems like you've got the means to steer away from futures you don't like in any sense, if you bother. Substitute hedonium for pink ponies. If an honest and corrigible AGI talks you into preferring a hedonium future, it seems like you actually like hedonium outcomes a lot. If it doesn't tell you the truth about what it's doing, it's misaligned in a more fundamental way than just having some minor preferences that don't align perfectly with yours. This might become a bit of an argument for corribible vs. value-aligned AGI. If it's aligned to what-you-value-in-the-future that gives it almost unlimited leeway to force you into that outcome. If you add a deontological stricture against manipulation, that might not solve the problem at all. This is Steve's point. My argument is that a corrigibility or instruction-following alignment target does solve the problem adequately if not perfectly, if your power over the ASI is used

7Charlie Steiner25d

What I'd really do is turn this AI off because I didn't think it was safe. Which, like, good job to the hypothetical interpretability / honesty / corrigibility / contingency systems work that puts me in that hypothetical situation. But as you mention, maybe the AI could avoid getting here in the first place. (And even if I turn it off, that's cold comfort if someone else makes the other choice a few days later.) I think manipulating me into not asking the question that leads to me shutting it off is definitely a strong choice that leads to more pink ponies. Or manipulating the situation so that it's not me who's asking the question, it's someone else. It can also modify what the honest answer to questions about its own behavior are by precommitting - it could deliberately choose the answer to be something less concerning, if that led to more pink ponies. I'm also pretty concerned that the standards for what counts as "honesty" will allow for strategy. (Isn't it paradoxical to manipulate me into not asking the question if it's not supposed to make some fixed manipulation-detector fire? No, it just has to simultaneously optimize against both my behavior and the firing of the manipulation-detector, and it's the result of this optimization that I'm saying I expect to be manipulative according to me-outside-the-thought-experiment. There are probably more sophisticated (albeit currently unknown) ways of incorporating my standards into the AI's decision-making that wouldn't be so vulnerable, and if we figure them out I hope we use them to solve value learning.) Partially I commented with this inside-the-thought-experiment versus outside-the-thought-experiment disconnect because I think it's interesting in general. It's kind of like Gödel sentences, or the vibrating record players from G.E.B. - me-inside-the-thought-experiment is a complicated enough system that he can be nudged in all sorts of ways if you understand him, and this property is hard to patch out. But the o

2Seth Herd24d

Right. The thought experiment is interesting. I think you're using a somewhat different model of this AGI than I am. [...] In my model, this ASI doesn't as much have a manipulation detector as genuinely want to not manipulate you, by the definition of stuff you'd consider manipulative. Of course that has to cache out in a manipulation detector of some sort; but assuming it would route around that detector sounds like assuming an ASI can fool itself. Which, maybe? Of course, training any desire to not manipulate into it is tricky. That doesn't come for free from a value-aligned ASI. But it does pretty much come for free with an instruction-following corrigible ASI told "don't manipulate me by my own standards". Because it genuinely wants to follow instructions/be corrigible, now it wants to do that. The manipulation detector includes its full, considerable cognitive capacity. The problem with a value-aligned, purely RL-trained ASI like Steve is thinking of is that you've got to hope that your training actually put a desire-to-not-manipulate-by-your-lights as a higher priority than any of its other desires, or perhaps than all of them put together. That sounds tricky at best, and like an additional hurdle for successful value aligned ASI. This pushes me back toward thinking that Instruction-following AGI is easier and more likely than value aligned AGI. I had been doubting that it's really much easier or more likely, given Anthropic's efforts and relative success at value alignment.

I think this is overly committed to the hypothesis that the various concepts intuitively necessarily route through an imagined free will.

Take non-manipulation. I don't think I need to invoke non-causal free will, or even free will at all to describe that. In the deceptive case: H has some preferences. A has some preferences. A communicates falsehoods or partial truths which (predictably) cause H to have incorrect beliefs. H acts on those beliefs (in a way which furthers A's preferences).

Now, this also has the effect of making the activity in the world steer more for A's preferences than H's, at least locally. If you have an ontology where preferences arise from non-causal free will or whatever, then you might also say that A's free will subdued H's in that exchange, I guess.

Am I correctly engaging with what you're describing here?

6Steven Byrnes25d

I think that any particular example might not involve explicitly invoking free will, but it’s kinda the water we’re swimming in, and over a lifetime of thinking in those terms you wind up with a soup of connotations and associations that are kinda dovetailing with a free will worldview even if they’re not explicitly about that. Regardless, you ask whether this post is “overly committed” to that hypothesis. I hope it’s not! I tried to cover all the other connotations of “manipulation” that I could think of in §3–§4. Thus, your particular example with “H” and “A” seems to mostly lean on §3.4 (A is optimizing for H’s future preferences) and §4.2 (an observer sympathetic to H would see the interaction as bad). (It’s possible that I structured the post in a way that over-emphasized the free will thing, but I do think the free will thing is tricky and important and hence deserves disproportionate space in the text.)

2Oliver Sourbut25d

I guess in my takeaway from the post (haven't reread since yesterday), the thrust was like: non-deception, corrigibility etc probably don't have 'true names', because they're concepts fundamentally attached to a fake ontology. In particular, non-causal vitalistic free will sort of things. But actually I appear to be able to describe deception and other things in terms which don't route through that kind of ontology, at all, and models like that are how I and many of my interlocutors discuss such things. Curiously I don't think any entries in section 3 or 4 correspond to the description I gave - agree that 3.4 is maybe closest (and on a different day I might have used a case example closer to that). To me that refutes the paraphrased thrust above. I grant that there's something funny with 'intentional stance' at all, where in principle one can describe any instantiated agent in terms of lower-level mechanics. So I guess one might say that ontology is mistaken (I don't think you would)? But that's just like all of physics basically. Complex systems are like that. You get emergence and have to deal with it. I hold that beliefs and intentions and preferences are worthy concepts to build with (and I think you do too). Indeed various colloquial conceptions of agents are more mystical and incoherent. But I don't think that has bearing on whether there's a true name for the concepts discussed in more coherent terms.

4Steven Byrnes24d

Not exactly … What I had in mind overall was more like: the discussion of human intuitions in §2 “sets the stage” by (A) providing some brainstorming aid and grounding as we embark on the Quest for a True Name For Manipulation in §3–§4; and §2 also “sets the stage” by (2) seeding the idea that we should at least be open to the possibility of such a True Name not existing at all, rather than it being an obvious thing, like how I know that a sock exists even if I don’t how to mathematically define it, because I have an intuitive notion of a sock and it’s all-but-certain that this intuitive notion can in principle be formalized into some real-world notion of sock-ness. Whereas for manipulation, maybe it can be construed as pointing to something real, or maybe it can’t, but I think it’s important to be in a mindset where this is not obvious, like it is for socks. Then the alignment-relevant meat of the post is §3–§4, along with further discussion in §5. Sorry if I wrote things up in a confusing way, but I’m not seeing an obvious way to improve it right now.

4Oliver Sourbut24d

I think I see. That's all good! I for one appreciate this clarification here.

3Oliver Sourbut25d

There's also a curious and troubling aspect of humans that we mingle the types of beliefs, intentions, preferences, and intrinsic vs instrumental stuff. We're also malleable which makes things like CEV potentially underdetermined. I'm not entirely sure what to do about that! Though I unironically anticipate that the right virtue orientation can help a lot with selecting 'acceptable enough' paths through that highly path-dependent space.

Another example is @jacob_cannell in Empowerment is (almost) All We Need (2022). To his credit, he raises the issue that human desires are manipulable in the section “Potential Cartesian Objections”. But then he waves this issue away in a brief sentence: “These cartesian objections are future relevant, but ultimately they don't matter much for AI safety because powerful AI systems - even those of human-level intelligence - will likely need to overcome these problems regardless.”

Well .. no - looking back on that post, the 'potential cartesian objections' ar... (read more)

Path-dependence of values is defeated with aggregation over the possible paths that should have a say in what the values should be. Aggregation over many possibilities takes place in an updateless view from before those possibilities diverge. What kinds of possibilities should contribute to defining values is determined by values. And the possibilities should perhaps be shaped with the aid of aggregated values, to channel their counsel.

This sets up an analogy between CEV and updateless decision making, where the updateless core is working to define values,... (read more)

My understanding is that ARC Theory has thought some about this topic and has proposals for avoiding manipulation. I think their only public output on this is in an appendix of the ELK report 'Avoiding subtle manipulation'. I recommend reading '“Narrow” elicitation and why it might be sufficient' and 'Indirect normativity: defining a utility function' first for context.

I think @Lukas Finnveden and maybe @Eric Neyman thought about this more after this point. (My recollection is that at least @Lukas Finnveden updated toward thinking the problem was harder an... (read more)

4Fabien Roger13d

My (low confidence) understanding of the proposal is something like: "The AI takes an action A if and only if {long-term future-self if the AI takes action a | a in the action space} on aggregate like A" where "long-term future self" is defined by some recursive process where you locally choose what entity counts as your near-term future self (it can be some other entity that you trust more - e.g. a future aligned AI), where these future selves all have access to an AI that honestly answers questions that are already meaningful to the human when the right answer would be unambiguous to someone who understood what was happening in the world, and where "on aggregate" is defined by whatever process you'd feel comfortable with when choosing how to delegate decisions over a set of future selves with different histories (e.g. it's fine to ignore future selves with crazy beliefs or that went through processes that are unambiguously not fine). (More precisely, the aggregation is local and can change over time, I wrote it as a single aggregation over the long-term future selves for simplicity.) Worked examples: * In the "super persuasive paperclip ad" from the ELK report, most of your future selves can ask the AI what the world would look like in the branch where you saw the ad, this is the sort of question where the AI is honest so it tells them that it is full of paperclips, almost all of your future selves hate it, thus the AI does not choose to show you the super persuasive paperclip ad * If the AI has a big list of experiences that it can subject you to that would update your relative moral weight for cats vs dogs in ways the AI can anticipate and such that there is some path dependence that makes later experiences less effective such that there is no convergence if you just present all of them, and it has to choose which experiences to subject you to and in which order, then the AI doesn't have much leeway in how much to push you in the direction of cats vs dogs,

To me the most promising solution is to get the AI to not optimize for influencing people's beliefs (3.4) except in certain permitted (often myopic) ways that depend on the situation. Some candidate guidelines:

Early on, helping AI companies understand risks seems crucial and allowed.
When they help with development of the model spec, they can help people understand relevant considerations to the model spec, but should do so myopically and should not consider consequences downstream of those people's beliefs (e.g., on the model spec, on the people's actions)

... (read more)

On alternatives to brute consequentialism, I was intrigued by Grietzer's formulation of virtue as 'promote x x-ingly'. e.g. to be just is to promote justice justly, to be honest is to promote honesty honestly, ... See also. It's only a few pieces of a picture, but looked like a promising direction, to me.

4Steven Byrnes26d

I’m mildly skeptical that we can drop consequentialist preferences altogether (I mean, as first-class preferences, I’m not just talking about ‘consequentialist planning as a means-to-an-end for being helpful right now’ and similar). I don’t have any airtight proof, and I really have some uncertainty here, but FWIW I’m partly getting that from a general intuition that the people doing (and figuring out) important and novel things in the world, the kinds of things that would help with the rogue ASI problem, are people really deeply care about simulacrum level 1 rather than 3, and you only get that from having consequentialist preferences as major first-class components of the motivation system. A different idea is that we can build top-level preferences out of a mix of consequentialist desires and non-consequentialist (e.g. virtue-ethics-y) desires. And yeah, that seems like the obvious “Plan A” to me. The question is whether it would actually work, see §1.2.

6Oliver Sourbut25d

I don't understand how simulacrum levels come into it [1] . Are you taking me to mean 'virtue signalling' when I say virtue? No! Adherence to honest and earnest communication, for example, is a commonly recognised virtue. 1. FWIW I also have never got what is supposedly ordinal about the simulacrum levels beyond 1, the honest one. The other 'levels' just look like various orthogonal breeds of fakery, to me. Haven't scrutinised deeply. ↩︎

Oh, hmm, good point, thanks. Let me try again:

When I think of humans who get difficult things done, or figure difficult things out, they tend to care about accomplishing those things, a lot, and in a direct and explicit way, not just e.g. as a facet of what kind of person they see themselves as. I mean, maybe “what kind of person I see myself as” has something to do with how they originally came to care about those things, but it’s not what they’re explicitly thinking about. They’re thinking directly about the object-level prize at the end of the journey, and how to get that prize.

E.g. plenty of climate change activists think of climate change activism as a good and virtuous thing to do, but I think the subset of climate change activists who are really moving the needle are the ones who are directly thinking about climate change being directly bad, and really want it to stop, and are focused directly on how to make that happen.

E.g. plenty of mathematicians think of math as a good and praiseworthy activity, but I think that the person who will solve the Riemann hypothesis will be a person who is (in addition to being smart etc.) really damn curious about why the Riemann hypothesis i... (read more)

3Steven Byrnes24d

(Off-topic but fun) I think it’s at least somewhat ordinal, e.g. Zvi’s “Level 1: Symbols describe reality. Level 2: Symbols pretend to describe reality. Level 3: Symbols pretend to pretend to describe reality. Level 4: Symbols need not pretend to describe reality.” See also Thane’s attempt.

2Oliver Sourbut24d

(Agree fun) These have looked post-hoc and overfitted to me, and the variety of apparent but incompatible explanations for the supposed ordering is a smell from my pov. Maybe I should look closer at some point.

Yet another example is @abramdemski’s “Vingean agency” (2022) (following earlier work by Yudkowsky). He starts from a place very close to the human intuitions in §2 above, i.e. “agency” is when you can predict the outcome but not the actions leading to those outcomes. Then he hints at an intriguing idea that maybe we can just make that formal! I.e., maybe I was too quick to dismiss that kind of thing above using terms like “messed-up ontology”. (As Abram writes: “I also think it's possible that Vingean agency can be extended to be ‘the’ definition of agenc

... (read more)

That suggests an approach where AIs simply learn to make a human-like gestalt assessment of what’s good vs bad (according to humans in general, and according to this particular human specifically), and then do the good things but not the bad things. Not manipulating people would, we imagine, naturally come along for the ride.
If we do something like that, I wouldn’t think of it as a solution to the True Name thing, but rather giving up on the True Name thing in favor of a different approach entirely, one relying more directly on (real or simulated) human ju

... (read more)

3Steven Byrnes25d

Yeah, exactly, when my shoulder-optimist argues his case, he brings up the idea that the virtue-ethics-y side of the AI would maybe notice that the AI is engaging in a systematic pattern of behavior that has the effect of gradually shifting human judgment and culture over time, and it would see that as bad, and it would vote against behaving in that way. But it also might not. That’s why I said (in §1.2 and §4.2) that I was unsure about how bad a problem this is. I’m kinda stuck on that right now, and not sure how to proceed, except to try to engineer some different solution that’s easier for me to reason about.

2RogerDearnaley23d

The virtue-ethics-y side of the AI would be correct that most people would consider this to be a case of the hedonium-maximizer side of the LLM tricking and manipulating them. Most people are aware of the fact that they would feel better but be worse if the took heroin or wireheaded, and don't take heroin or wirehead, or approve of other people trying to trick them into it. Evolutionarily, this looks like a case of humans doing the right thing.

Max’s stopgap plan would involve comparing the human’s values to what they’d be if the AI did nothing. I think he understates how bad that stopgap plan is. Even providing straightforwardly-true factual information can change what a person wants, right?

Yes. That's right. And I am (among other things) worried about an AI that warps my values by telling me a series of facts.

But I want to clarify that I'm talking about terminal values, not strategic sub-goals. The stopgap plan is 100% able to tell me that the store is closed, thus changing my plan of going to ... (read more)

3Max Harms25d

Oh, uh. Whoops. Forgot to switch to my work account.

4Steven Byrnes25d

Hmm. I guess when I wrote that part, I was imagining a kind of dichotomy, where one branch is “don’t change what the human is trying to do right now” (and then the AI wouldn’t say that the store is closed), and the other branch is “don’t change what the human would want after infinite ideal reflection” (and then I don’t know how to install that motivation into the particular AI architecture I have in mind). I guess you’re saying that that’s a false dichotomy, because there’s a middle ground in between those? Have you written more about what constitutes a “terminal goal” in your view? (Even if you don’t have a rigorous definition, I’m interested in examples or intuitions.) Thanks.

3Max Harms24d

I haven't written at length about the distinction between terminal and instrumental goals myself (there's a bit at the start of CAST, but I don't belabor it), but I think Eliezer did a good job in 2007. In my own words, I would say that it makes some sense to divide the planning system of the mind into a portion that is a model of the world, where it makes sense to talk about truth and so on, and another section of the mind that is about judging the desirability of various potential world states and/or trajectories. That second portion (or an important component of it) is what I would call the "values" of the agent, and when the values are put in contact with concrete outcomes that are judged highly compared to others, I would call them terminal goals. Instrumental goals are then constructed as a second-order operation on top of one or more terminal goals (and the dynamics of the world model), so that we can shortcut planning as a question of how to first get the instrumental goal so that we can later move from that state to the terminal goal. As a concrete example, I wanted to go home after work last night (which is itself an instrumental goal in the service of many other terminal goals, such as comfort, but which we can treat as terminal). I planned to drive in my car through a small town to get home, and thus steered my car towards the town, because "get to the town" was instrumental to my (more) terminal goal. As I approached, I found out that there had been an accident and that the road was closed. If my world model had included this fact, I would not have identified "get to the town" as an instrumental goal. Once I was aware of it, I changed my plan so that I drove down a country detour that went around the town. I do not consider learning about the accident to have changed my values or the way that I judged outcomes. Instead, it changed my plan. Does that make sense?

4Steven Byrnes5d

Thanks. I just edited the OP to say that my original text might be an overstatement. I still think the stopgap plan doesn’t help me-in-particular, because I’m working on how to install goals in brain-like AGIs, and I have ideas that seem promising but only work for a limited number of goals (they kinda have to be simple, concrete, “atomic”, and/or directly related to people’s feelings, and/or have a ground truth that can be calculated explicitly, more-or-less). This thing we’re talking about here (involving a distinction between the supervisor’s instrumental vs terminal goals) is pretty complex and abstract, and not something I have any good idea of how to install as a goal / motivation, alas. LLMs are pretty different, no comment on that.

If engineering grade solutions to the problem are acceptable, a design for nanotech or biotech that vastly increases the amount of conventionally controllable matter/energy and doesn't come with a backdoor that instantiates another intelligence seems like straight empowerment.

The messy dilemma only come up when elements of decision-making are outsourced to AI. Tooling that makes a specific baseline human comprehensible course of action possible in the first place only gives the AI control in that the AI can limit the performance of that tooling and so pred... (read more)

The virtue-ethics-y motivation just seems more squishy and slippery than the consequentialist desire, especially when it routes through manipulable human desires, such that I’m worried it will not be an adequate bulwark against ruthless consequentialism.

Preregistered my prior before evaluating it: I suspect that this is the crux of any disagreement I'll have with your article (from the point at which it was read; this comment is currently a midstage draft). (I was right this time!)

Most of the humans whom I've seen put forward as moral and ethical exemplars... (read more)

5Steven Byrnes5d

Most of your comment seems specific to LLMs, and I don’t work on those, so no opinion. [...] This might be tangential to your larger point, but based on your list of examples, I think you (like most people) are implicitly using virtue-ethics as a rubric to judge which humans are most praiseworthy. So it’s no surprise that the winners are generally acting out of virtue-ethics. By contrast, if you ask a utilitarian which humans are most praiseworthy, they would be less likely to mention the foster-parents etc., and much more likely to mention, like, Norman Borlaug, Bill Gates, these people, etc. And I would guess that those latter people would be somewhat more consequentialist-utilitarian than average in how they choose their actions. (That’s just a guess, I don’t know much about most of them, except that I watched a biopic of Bill Gates once and he didn’t come across as extremely stereotypically virtuous.) (I’m making a narrow point that you used a circular argument, I am not trying to imply here that AIs should or shouldn’t be virtuous. But see this comment.)

But alas, I currently think the most important use-case for AGI is figuring out true important things about the world (esp. related to ASI alignment and strategy) and explaining those things to the human. For this process to be effective, we cannot have an AGI that’s unconcerned with what we wind up believing after the discussion—that’s a recipe for slop-and-doom, or just an AI that’s incomprehensible and unhelpful. Rather, I want the AI to be like a disagreeable nerd that wants us to have a good understanding, notices areas where we’re confused, and is br

... (read more)

Our intuitive notion of being manipulated is related to a person (call him Ahmed) taking an action A with the property that, in our intuitive causal world-models, the chain-of-causation leading to A does not ultimately trace back to the acausal force of Ahmed’s free will, but rather to the free will of some third-party who manipulated Ahmed.
(For example, if Bob deceives me about what a button does, and then I press the button, then our intuitive conceptualization of the situation says that the button was pressed ultimately because of Bob’s acausal free wil

... (read more)

On the topic of distinguishing between Counsel and Manipulation, without going directly through whether the resulting beliefs are accurate or not (Because that's adequately covered and I think the observation that you can manipulate through selective truths somewhat defeats it- given someone who's already wrong or who has underspecified preferences, you can probably manipulate their actions while only making their beliefs more correct.), I propose: That one(and maybe the) distinction between Counsel and Manipulation is whether the outcome of the action depends on the person being spoken to.

That is: If the AI wants and manipulates for Pink Ponies, then it will tailor its arguments to make sure, no matter who it is speaking to, they will conclude the right answer is pink ponies. We do not need to know who the AI is talking to; maybe they hate horses. Maybe they're obsessed with manly colors. Regardless, we know the outcome will end up being pink ponies. The AI may need to simulate the target in detail to manipulate them, but the AI can predict where the conversation will go without doing the simulation.

If the AI is counselling, then it will not do this. It might tell the target true ... (read more)

This was a fascinating read, which brought me back to my first year of Political Science and discussions on power and influence. This text resonates a lot with the 3 dimensions/faces of power by Steven Lukes (in Power A Radical View):

Direct Influence: A makes B do something they would not otherwise do
Agenda-Setting: decide which issue gets to be discussed or ignored and, often, an elite benefits from the status quo, so they tend to prevent any paradigm-shifting change to arise in political discussions
Opinion Shaping: manipulating people's desires in order

... (read more)

This was excellently written and helpful. I am developing a series of posts on a framing problem which underlies current alignment research, and my point connects to what you are observing here. (First part here). I would be very interested in your input on those posts. That said, those posts are not intended to look directly at the "messed-up ontology" you describe concerning empowerment, agency, manipulation, corrigibility, and so on, and I want to comment on this, because I think you are making an important observation.

Free Will

The framework you are re... (read more)

Rather, I want the AI to be like a disagreeable nerd that wants us to have a good understanding, notices areas where we’re confused, and is brainstorming and strategizing on how to help set us straight by improving its clarity and pedagogy.

It seems like it is difficult to distinguish between the disagreeable nerd described above and the same exact type of person, but who's trying to get you to have a good understanding of a conspiracy theory or religion (ie. something that they believe to be true, but is not). The conspiracy theorist or religious true beli... (read more)

More precisely: If there are deterministic upstream explanations of what the Active Self is doing and why, e.g. via algorithmic or other mechanisms happening under the hood, then that feels like a complete undermining of one’s free will and agency. And if there are probabilistic upstream explanations of what the Active Self is doing and why, e.g. “if my stomach is empty, then I’ll start wanting food”, then that correspondingly feels like a partial undermining of free will and agency, in proportion to how confident those predictions are. For example, I migh

... (read more)

I totally agree that everything flounders on the messed up idea of free will. My "solution" is to abandon alignment etc. altogether and instead focus on limiting the possible damage: I don't see how, with messed up ontology, we can ever guarantee that, given enough time, an ASI won't tile the universe with hedonium - so why not focus on restricting the time? I'm pretty sure the ASI can't do it in only a few years so, if we can set it a time limit of say, 5 years and to not value anything beyond that window then we "could" be safe. This fits in with the cur... (read more)

I think the idea of "preserving human agency" in a world with AGI is even less tractable than you're gesturing at, but in a structural way. Imagine having a trusted translator navigate a bus full of strangers on a trip in Italy. I don't think there's a "true" trip that captures all reflexively preferred destinations that can be "discovered" (considering the country is large, and your trip is very short). Now imagine the translator reading all of the passengers' minds. There's a lot of path dependence in what the translator prefers and thinks the group valu... (read more)

So why can’t “agency” be tied to imperfect modeling ability too?

If we apply an analogy with temperature, then how similar is free will to an LLM's temperature causing the LLM to output slightly different token sequences which, however, convey a similar meaning? How plausible is it that similar analogies improve our understanding of humans' agency, empowerment and other important concepts?

6Charlie Steiner26d

If I'm understanding you right, this is sort of similar to the analogous question for humans - how similar is my free will to the fact that my neurons have a temperature, so for every action I take I have some large probability of doing something slightly different, and a small probability of doing something very different. Personally, I don't associate that fact much with "free will" - it has the freedom, but not the will! Free will, to me, is about doing things because I want. That is, when the most important explanatory story for why I take an action ends in my own psychological state (rather than in my environment or in someone else's machinations), that's when I'm being most free-willed. An explanation based on the temperature of my neurons is in the right physical location but lacks relevance to my psychology.

(I wrote this in reply to a draft; apologies if the post has been load-bearingly updated since then.)

I interpret you as saying:

(Likewise with the painting example: when looking at little patches of a large complex painting, people will totally lose track of context and overlook inconsistencies.)

It really doesn't seem like humans "keep their eye on the ball" all the time, even in the large majority of day-to-day cases where cognition basically works.

4Steven Byrnes25d

1DeathDriveAnnabelle4d

1Q Home25d

2Steven Byrnes25d

1Q Home25d

4Steven Byrnes25d

4Q Home25d

4Seth Herd25d

However, I will consider plans like "tie a string around my finger so I remember to buy food" or "set a reminder".

4StanislavKrym26d

2kbear26d

2Charlie Steiner26d

1Jono1d

1J Bostock25d

This AGI would not face anything remotely analogous to the conundrum that humans don’t really know what they want for the long-term future, and are figuring it out, and when they say ‘manipulation is bad’ they are expressing some hard-to-pin-down preference about how this process of self-discovery plays out.

Thanks a lot for writing this.

2N1X6d

3rif a. saurous3d

I think you're correctly identifying important issues and cracks in the standard ontology, but I think you're throwing out too much baby in an effort to get rid of bathwater.

4Max Harms24d

3RogerDearnaley23d

5Charlie Steiner26d

4Seth Herd25d

7Charlie Steiner25d

2Seth Herd24d

I think this is overly committed to the hypothesis that the various concepts intuitively necessarily route through an imagined free will.

Am I correctly engaging with what you're describing here?

6Steven Byrnes25d

2Oliver Sourbut25d

4Steven Byrnes24d

4Oliver Sourbut24d

I think I see. That's all good! I for one appreciate this clarification here.

3Oliver Sourbut25d

Another example is @jacob_cannell in Empowerment is (almost) All We Need (2022). To his credit, he raises the issue that human desires are manipulable in the section “Potential Cartesian Objections”. But then he waves this issue away in a brief sentence: “These cartesian objections are future relevant, but ultimately they don't matter much for AI safety because powerful AI systems - even those of human-level intelligence - will likely need to overcome these problems regardless.”

Well .. no - looking back on that post, the 'potential cartesian objections' ar... (read more)

This sets up an analogy between CEV and updateless decision making, where the updateless core is working to define values,... (read more)

4Fabien Roger13d

Early on, helping AI companies understand risks seems crucial and allowed.
When they help with development of the model spec, they can help people understand relevant considerations to the model spec, but should do so myopically and should not consider consequences downstream of those people's beliefs (e.g., on the model spec, on the people's actions)

... (read more)

4Steven Byrnes26d

6Oliver Sourbut25d

Oh, hmm, good point, thanks. Let me try again:

3Steven Byrnes24d

2Oliver Sourbut24d

Yet another example is @abramdemski’s “Vingean agency” (2022) (following earlier work by Yudkowsky). He starts from a place very close to the human intuitions in §2 above, i.e. “agency” is when you can predict the outcome but not the actions leading to those outcomes. Then he hints at an intriguing idea that maybe we can just make that formal! I.e., maybe I was too quick to dismiss that kind of thing above using terms like “messed-up ontology”. (As Abram writes: “I also think it's possible that Vingean agency can be extended to be ‘the’ definition of agenc

... (read more)

That suggests an approach where AIs simply learn to make a human-like gestalt assessment of what’s good vs bad (according to humans in general, and according to this particular human specifically), and then do the good things but not the bad things. Not manipulating people would, we imagine, naturally come along for the ride.
If we do something like that, I wouldn’t think of it as a solution to the True Name thing, but rather giving up on the True Name thing in favor of a different approach entirely, one relying more directly on (real or simulated) human ju

... (read more)

3Steven Byrnes25d

2RogerDearnaley23d

Max’s stopgap plan would involve comparing the human’s values to what they’d be if the AI did nothing. I think he understates how bad that stopgap plan is. Even providing straightforwardly-true factual information can change what a person wants, right?

Yes. That's right. And I am (among other things) worried about an AI that warps my values by telling me a series of facts.

3Max Harms25d

Oh, uh. Whoops. Forgot to switch to my work account.

4Steven Byrnes25d

3Max Harms24d

4Steven Byrnes5d

The virtue-ethics-y motivation just seems more squishy and slippery than the consequentialist desire, especially when it routes through manipulable human desires, such that I’m worried it will not be an adequate bulwark against ruthless consequentialism.

5Steven Byrnes5d

But alas, I currently think the most important use-case for AGI is figuring out true important things about the world (esp. related to ASI alignment and strategy) and explaining those things to the human. For this process to be effective, we cannot have an AGI that’s unconcerned with what we wind up believing after the discussion—that’s a recipe for slop-and-doom, or just an AI that’s incomprehensible and unhelpful. Rather, I want the AI to be like a disagreeable nerd that wants us to have a good understanding, notices areas where we’re confused, and is br

... (read more)

Our intuitive notion of being manipulated is related to a person (call him Ahmed) taking an action A with the property that, in our intuitive causal world-models, the chain-of-causation leading to A does not ultimately trace back to the acausal force of Ahmed’s free will, but rather to the free will of some third-party who manipulated Ahmed.
(For example, if Bob deceives me about what a button does, and then I press the button, then our intuitive conceptualization of the situation says that the button was pressed ultimately because of Bob’s acausal free wil

... (read more)

If the AI is counselling, then it will not do this. It might tell the target true ... (read more)

Direct Influence: A makes B do something they would not otherwise do
Agenda-Setting: decide which issue gets to be discussed or ignored and, often, an elite benefits from the status quo, so they tend to prevent any paradigm-shifting change to arise in political discussions
Opinion Shaping: manipulating people's desires in order

... (read more)

Free Will

The framework you are re... (read more)

Rather, I want the AI to be like a disagreeable nerd that wants us to have a good understanding, notices areas where we’re confused, and is brainstorming and strategizing on how to help set us straight by improving its clarity and pedagogy.

More precisely: If there are deterministic upstream explanations of what the Active Self is doing and why, e.g. via algorithmic or other mechanisms happening under the hood, then that feels like a complete undermining of one’s free will and agency. And if there are probabilistic upstream explanations of what the Active Self is doing and why, e.g. “if my stomach is empty, then I’ll start wanting food”, then that correspondingly feels like a partial undermining of free will and agency, in proportion to how confident those predictions are. For example, I migh

... (read more)

So why can’t “agency” be tied to imperfect modeling ability too?

6Charlie Steiner26d

171

Empowerment, corrigibility, etc. are simple abstractions (of a messed-up ontology)

171

Ω 47

1.1 Tl;dr

1.2 Bigger-picture context: why is this issue so important to me?

2. How do humans intuitively define empowerment, agency, manipulation, etc.?

2.1 Background: human “free will” intuitions

2.2 Our free-will-infused intuitive notions of empowerment, agency, manipulation, corrigibility, responsibility, etc.

2.3 Another dimension: “counsel” vs “manipulation” as an emotive conjugation

3. If the intuitive definitions of “manipulation” etc. reside in a messed-up ontology, has the alignment literature found any alternative, better way to define these concepts?

3.1 Compare what the human wants to what the human would want under the null policy?

3.2 The AI learns self-empowerment and generalizes to other-empowerment?

3.3 “Vingean agency”?

3.4 The AI doesn’t care about (is not optimizing for) what the human winds up wanting?

3.5 Impact minimization?

3.6 Attainable utility preservation?

4. Even more ideas (that don’t really solve my problem)

4.1 Game theory and incentive design?

4.2 The person’s judgments of what kinds of interactions are good vs bad?

4.3 “It’s a messed-up ontology, but who cares?”

5. …But doesn’t this analysis equally “disprove” the possibility of human helpfulness?

6. Conclusion

171

Ω 47

171

Ω 47