Value Formation: An Overarching Model

I think this post is very thoughtful, with admirable attempts at formalization and several interesting insights sprinkled throughout. I think you are addressing real questions, including:

why do people wonder why they 'really' did something?
How and when do shards generalize beyond contextual reflex behaviors into goals?
To what extent will heuristics/shards be legible / written in "similar formats"?

That said, I think some of your answers and conclusions are off/wrong:

You rely a lot on selection-level reasoning in a way which feels sketchy.
I doubt your conclusions about GPS optimizing activations directly, as a terminal end and not as yet another tactic,
I doubt the assumptions on GPS being a minimizer, or goals being minimize-distance (although you claimed in another thread this isn't crucial?)
I don't see why you think heuristics (shards?) "lose control" to GPS.
I don't think why you think value-humans shard has to be perfectly aligned.

Overall, nice work, strong up, medium disagree. :)

[heuristics are] statements of the following form: "if you take such action in such situation, this will correlate with higher reward".

I think that heuristics are reflections of historical facts of that form, but not statements themselves.

But these tendencies were put there by the selection process because the correlations are valid.

In a certain set of historical reward-attainment situations, perhaps (because this depends on the learning alg being good, but I'm happy to assume that). Not in general, of course.

a) The World-Model. Initially, there wouldn't have been a unified world-model. Each individual heuristic would've learned some part of the environment structure it cared about, but it wouldn't have pooled the knowledge with the other heuristics. A cat-detecting circuit would've learned how a cat looks like, a mouse-detecting one how mice do, but there wouldn't have been a communally shared "here's how different animals look" repository.
However, everything is correlated with everything else (the presence of a cat impacts the probability of the presence of a mouse), so pooling all information together would've resulted in improved predictive accuracy. Hence, the agent would've eventually converged towards an explicit world-model.

What is the difference, on your view, between a WM which is "explicit" and one which e.g. has an outgoing connection from is-cat circuit to is-animal?

b) Cross-Heuristic Communication.

I really like the insight in this point. I'd strong-up a post containing this, alone.

Anything Else? So far as I can tell now, that's it. Crucially, under this model, there doesn't seem to be any pressure for heuristics to make themselves legible in any other way. No summaries of how they work, no consistent formats, nothing.

If the agent is doing SSL on its future observations and (a subset of its) recurrent state activations, then the learning process would presumably train the network to reflectively predict its own future heuristic-firings, so as to e.g. not be surprised by going near donuts and then stopping to stare at them (instead of the nominally "agreed-upon" plan of "just exit the grocery store").

Furthermore, there should be some consistent formatting since the heuristics are functions $h : M_{s} \to A_{s}$ . And under certain "simplicity pressures/priors", heuristics may reuse each other's deliberative machinery (this is part of how I think the GPS forms). EG there shouldn't be five heuristics each of which slightly differently computes whether the other side of the room is reachable.

That's very much non-ideal. The GPS still can't access the non-explicit knowledge — it basically only gets hunches about it.
So, what does it do? Starts reverse-engineering it. It's a general-purpose problem-solver, after all — it can understand the problem specification of this, given a sufficiently rich world-model, and then solve it. In fact, it'll probably be encouraged to do this.

I'm trying to imagine a concrete story here. I don't know what this means.

The second would plausibly be faster, the same way deception is favoured relative to alignment ^[4].

I don't positively buy reasoning about whether "deceptive alignment" is probable, on how others use the term. I'd have to revisit it, since it's on my very long list of alignment reasoning downstream of AFAICT-incorrect premises or reliant on extremely handwavy, vague, and leaky "selection"-based reasoning.

we might imagine a heuristic centered around "chess", optimized for winning chess games. When active, it would query the world-model, extract only the data relevant to the current game of chess, and compute the appropriate move using these data only.

Just one heuristic for all of chess?

Consider this situation:

I wish this were an actual situation, not an "example" which is syntactic. This would save a lot of work for the reader and possibly help you improve your own models.

That'll work... if it had infinite time to think, and could excavate all the procedural and implicit knowledge prior to taking any action. But what if it needs to do both in lockstep?

(Flagging that this makes syntactic sense but I can't actually give an example easily, what it means to "excavate" the procedural and implicit knowledge.)

This makes the combination of all contextual goals, let's call it $G_{Σ}$

Can you give me an example of what this means for a network which has a diamond-shard and an ice-cream-eating shard?

Prior to the GPS' appearance, the agent was figuratively pursuing $B_{Σ}$

Don't you mean "figuratively pursuing $G_{Σ}$ "? How would one "pursue" contextual behaviors?

So the interim objective can be at least as bad as $B_{Σ}$ .

Flag that I wish you would write this as "during additional training, the interim model performance can be at least as U-unperformant as the contextual behaviors." I think "bad" leads people to conflate "bad for us" with "bad for the agent" with "low-performance under formal loss criterion" with something else. I think these conflations are made quite often in alignment writing.

Prior to the GPS' appearance, the agent was figuratively pursuing $B_{Σ}$ ("figuratively" because it wasn't an optimizer, just an optimized). So the interim objective can be at least as bad as $B_{Σ}$ . On the other hand, pursuing $G_{Σ}$ directly would probably be an improvement, as we wouldn't have to go through two layers of proxies.

Example? At this point I feel like I've gotten off your train; you seem to be assuming a lot of weird-seeming structure and "pressures", I don't understand what's happening or what experiences I should or shouldn't anticipate. I'm worried that it feels like most of my reasoning is now syntactic.

The obvious solution is obvious: make heuristics themselves control the GPS. The GPS' API is pretty simple, and depending on the complexity of the cross-heuristic communication channel, it might be simple enough to re-purpose its data formats for controlling the GPS.

I think that heuristics controlling GPS-machinery is probably where the GPS comes from to begin with, so this step doesn't seem necessary.

Once that's done, the heuristics can make it solve tasks for them, and become more effective at achieving $B_{Σ}$ (as this will give them better ability to runtime-adapt to unfamiliar circumstances, without waiting for the SGD/evolution to catch them up).

Same objection as above -- to "achieve" $B_{Σ}$ ? How do you "achieve" behaviors? And, what, this would happen how? What part of training are we in? What is happening in this story, is SGD optimizing the agent to be runtime-adaptive, or..?

At the limit of optimality, everything wants to be a wrapper-mind.

Strong disagree.

I don't think this is what the coherence theorems imply. I think explaining my perspective here would be a lot of work, but I can maybe say helpful things like "utility is more like a contextual yardstick governing tradeoffs between eg ice cream eating opportunities and diamond production opportunities, and less like an optimization target which the agent globally and universally optimizes."
1. I am also worried about reasoning like "smart agents -> coherent over value-relevant -> optimizing a 'utility function' -> argmax on utility functions is scary (does anyone remember AIXI?)", when really the last step is invalid.
AFAICT I agree wrapper-minds are inefficient (seems like a point against them?).
I don't know why GPS should control reverse-engineering, rather than there being generalized shards driving GPS.
I think "internalize a system of norms" is not how people's caring works in bulk, and doesn't address the larger commonly-activated planning-steering shards I expect to translate robustly across environments (like "go home", "make people happy"). I agree there is a Shard Generalization Question, but I don't think "wrapper mind" is a plausible answer to it.

The GPS can recover all of these mechanics, and then just treat the sum of all "activation strengths" as negative utility to minimize-in-expectation.

Seems like assuming "activation strengths increase the further WM values are from target values" leads us to this bizarre GPS goal. While that proposition may be true as a tendency, I don't see why it should be true in any strict sense, or if you believe that, or whether the analysis hinges on it?

In short, the same way it's non-trivial to know what heuristics/instincts are built into your mind, it's non-trivial to know what you're currently thinking of.

Aside: I think self-awareness arises from elsewhere in the shard ecosystem.

One issue is that the value-humans shard would need to be perfectly aligned with human values, and that's most of this approach's promised advantage gone. That's not much of an issue, though: I think we'd need to do that in any workable approach.

What? Why? Why would a value-human shard bid for plans generated via GPS which involve people dying? (I think I have this objection because I don't buy/understand your story for how GPS "rederives" values into some alien wrapper object.)

Is there any difference between "goals" and "values"? I've used the terms basically interchangeably in this post, but it might make sense to assign them to things of different types.

I use "values" to be decision-influence, and "goal" as, among other things, an instrumental subgoal in the planning process which is relevant to one or more values (e.g. hang out with friends more as relevant to a friend-shard).

Other points:

I wish the nomenclature had been clearer, with $M_{s}$ being replaced by e.g. ${WM}_{subset}$ .
I think "U" is a bad name for the policy-gradient-providing function (aka reward function).

[-]Thane Ruthenis3y*Ω110

Thanks for extensive commentary! Here's an... unreasonably extensive response.

what it means to "excavate" the procedural and implicit knowledge

On Procedural Knowledge

1) Suppose that you have a shard that looks for a set of conditions like "it's night AND I'm resting in an unfamiliar location in a forest AND there was a series of crunching sounds nearby". If they're satisfied, it raises an alarm, and forms and bids for plans to look in the direction of the noises and get ready for a fight.

That's procedural knowledge: none of that is happening at the level of conscious understanding, you're just suddenly alarmed and urged to be on guard, without necessarily understanding why. Most of the computations are internal to the shard, understood by no other part of the agent.

You can "excavate" this knowledge by reflecting on what happened: that you heard these noises in these circumstances, and some process in you responded. Then you can look at what happened afterward (e. g., you were attacked by an animal), and realize that this process helped you. This would allow you to explicate the procedural knowledge into a conscious heuristic ("beware of sound-patterns like this at night, get ready if you hear them"), which you put in the world-model and can then consciously access.

That "conscious access" would allow you to employ the knowledge much more fluidly, such as by:

Incorporating it in plans in advance. (You can know to ensure there's no sources of natural noise around your camp, like waterfalls, because you'd know that being able to hear your surroundings is important.)
Transferring it to others. (Telling this heuristic to your child, who didn't yet learn the procedural-knowledge shard itself.)
Generalizing from it. (Translate it by analogy to an alien environment where you have to "listen" to magnetic fields instead. Or to even more abstract environments, like bureaucratic conflicts, where there's something "like" being in a forest at night (situation-of-uncertain-safety) and "like" hearing crunching noises nearby (subtle-predictors-of-an-impending-attack).)

None of that fluidity, on my understanding, would be easily replicable by the initial shard. If you're planning in advance, or are teaching someone, it'd only activate if you vividly imagine the specific scenario that'd activate it ("I'm in my camp at night and there's this noise"), which (1) you may not know to do to begin with, (2) is an excruciatingly slow style of planning. And the non-obvious logical generalizations are certainly not the thing it can do.

If you have that knowledge explicitly, though, you can just connect it to a node like "how to survive in a forest", and it'd be brought to your attention every time you poke that node.

2) Also, in a different thread, you note that the predictions generated by the world-model can sometimes be also hard to make sense of, so maybe it's not consistently-formatted either. I think what's happening, there, is that when you imagine concrete scenarios, you're not using just the world-model — you're actually "spoofing" the mental context of that scenario, and that can cause your shards to activate as if it were really happening. That allows you to make use of your procedural knowledge without actually being in the situation, and so make better predictions without consciously understanding why you're making them.

(E. g., the weird-noises-at-night shard puts simulated!you on high alert, and your WM conditions on that, and that makes it consider "is going to be attacked" more likely. So now it's predicting so, and it's a more accurate prediction than it would've been able to make with just the explicit knowledge, but it doesn't know why exactly it ended up in this state.)

But none of that makes that procedural knowledge explicit! (Though such simulated counterfactuals are a great way to reverse-engineer it. See: thought experiments to access and reverse-engineer morality heuristics.)

3) Also something worth noting: explicit knowledge can loop non-explicit procedural knowledge in! E. g., you can have an explicit heuristic like "if you're in a situation like this, listen to your instincts and do what they say". That's also entirely in-line with my model: you can know to do the things your shards urge you to, even if you don't know why. And yet, knowing that a black-box is useful isn't the same as knowing what's in it.

(I suppose my definition is kind of circular, here: I'm saying that the world-model is only the thing that's consciously accessible and consistently-formatted. That's... Yeah, I think I'll bite that bullet.)

On Implicit Knowledge

Here, it's "implicit" that you should be complying with the urge to engage in the contextual behavior = "if you heard weird noises in the forest at night, be on guard". The question to answer here is: why? Why does it make sense to be on guard in such circumstances?

There's several ways to explain it, but let's go with "because it decreases the chance that a predator could take me by surprise, which is (apparently) something I don't want to happen". That's the implicit contextual goal $G$ here.

Explicating it, and setting it as the plan-making target ("how can I ensure I'm not ambushed?"), can allow you to consciously generate a bunch of other heuristics for achieving it. Like looking out for weird smells as well, or soundless but visible disturbance in the tall grass around you, etc. This, likewise, boosts your ability to generalize: both in the environment you're in, and even if you end up displaced to e. g. an alien environment.

I also refer you to my previous example of a displaced value-child. Although his study-promoting shards end up inapplicable, he can nonetheless remain studious if he has "be studious" as an explicit goal, in the course of optimizing for which he can re-derive new heuristics appropriate for the completely unfamiliar environment. Another example: the "deontologist vs. utilitarian in an alien society" from the fourth bullet-point here.

Extrapolation

Okay, and this naturally extends into my broader point about value compilation.

Suppose you explicate a bunch of these contextual goals, like "avoid being ambushed by a predator" and "try to escape if you can't win this fight" and "good places to live have an abundance of prey around".

You can view these as heuristics as well. Much like the behaviors you were urged to engage in, which only hinted at your actual goal, you can view these derived goals not as your core values, but as yet more hints about your real values. As next-level procedural knowledge, with some hypothetical broader goal that generated them, and which is implicit in them.

Upon reflection on this new set of goals, you can extrapolate them into something like "avoid death".

Doing that has all the benefits of going from "if at night in a forest and hear crunching sounds, be on guard" to "decreases the chance that a predator could take me by surprise". You can now pursue death-avoidance across a broader swathe of environments, and with more consistency and fluidity. You can generate new lower-level goals/heuristics for supporting it.

Then you generate some more higher-level goals, e. g. "avoid death" + "make my loved ones happy" + "amass resources for the tribe", and compile them into something like "human prosperity is important".

And so on and on, until all contextual behaviors and goals have been incorporated into some unified global goal.

Those last few steps is what you disagree with, I think, but you see how it's just a straightforward extrapolation of basic lower-level self-reflection mechanisms? And it passes my sanity-checks: it sure seems consistent with moral philosophy and meaning-of-life questioning and such.

Core Claims

Procedural knowledge is raw shard activations, i. e. urges that have no conscious explanation.
Explicating procedural knowledge allows you to use it in plan-making in a flexible logical manner, instead of relying on being in the right mental context for it to activate.
Imagining future scenarios isn't just running the WM forward, it's also spoofing the mental context to provoke shard activations, and that allows you to make use of procedural knowledge for predictions without necessarily understanding it.
The above doesn't make procedural knowledge part of the WM; nor does it imply that the WM isn't consistently-formatted. Or, perhaps tautologically, the WM is only that which is consistently-formatted and consciously-accessible.
Implicit knowledge are the hypothetical goals that the procedural knowledge is meant to achieve. Explicating it allows one to optimize for these goals in new contexts, and derive new heuristics for achieving them.
A straightforward extrapolation of this process leads to treating first-order derived contextual goals as just another set of heuristics, which imply some second-level contextual goal.
This process is run iteratively, until all goals are incorporated.

I don't know why GPS should control reverse-engineering, rather than there being generalized shards driving GPS.

Okay, so my thinking on this updated a bit since I've written the post. I think the above process, "treat shards as hints towards your goals, then treat the derived goals as hints towards higher-level goals, then iterate", isn't something that shard economies want to do. Rather, it's something that's convergently "chiseled into" all sufficiently advanced minds by greedy algorithms generating them.

Consider a standard setup, where the SGD is searching for some agent that scores best according to some reward function $R$ . Would you disagree that a wrapper-mind with that function as its terminal objective would be a great design for the SGD to find, by the SGD's own lights? Not that it would "select" for such a mind, just that it would be pretty good for it if it did find a way to it?

Shard economies and systems of heuristics may be faster out-of-the-box, better adapted to whatever specific environment they're in. But an $R$ -maximizing wrapper-mind would at least match their performance, given some time to do runtime optimization of itself. If it would improve its ability to optimize for $R$ , it can just derive contextual shards/heuristics for itself and act on them.

In other words, an $R$ -maximizer is strictly more powerful according to $R$ than any shard economy, inasmuch as it can generate any purpose-built shard economy from scratch, and ensure that this shard economy would be optimized for scoring well at $R$ .

Shard economies not governed by wrapper-minds, in turn, are inferior: they're worse at generalizing (see my points about non-explicit knowledge above), and tend to go astray if placed in unfamiliar environments (where whatever goals they embody no longer correlate with $R$ ).

And inasmuch as the level of adversity the agent was subjected to is so strong as to cause it to develop general reasoning at all, it's probably put in environments so complex/diverse that runtime re-optimization of its entire swathe of heuristics is called-for. Environments where nothing less than this will do.

So the practical advanced mind design is probably something like a shard economy optimized for the immediate deployment environment (for computation speed and good out-of-the-box performance) + an $R$ -aligned wrapper-mind governing it (for handling distribution shifts and for strategic planning). So I speculate that the SGD tries to converge to something like this, for the purposes of maximizing $R$ .

Except, as per section 5, there's no gradients towards representing $R$ in the agent, so the SGD uses weird hacks to point the GPS in the right direction. It does the following:

Does not codify any object-level terminal goals for the GPS.
Lets shards influence the GPS' plan-making process.
Lets the GPS reverse-engineer shards, the procedural and implicit knowledge they represent.
Encourages the GPS to treat these reverse-engineered knowledge tidbits as hints towards some hypothetical unified objective that it's meant to adopt as its real target.
The GPS engages in value compilation as I've outlined, and tends to compile goal-spreads that are closer to $R$ the more it engages in this process — inasmuch as the shard economy it's using to derive its goals is itself optimized for $R$ .

This hack lets the SGD point the "proto-wrapper-mind" in the direction of $R$ without actually building $R$ into it. The agent was already optimized for achieving $R$ , so the SGD basically tasks it with "figure out what you're optimized for, and go do that", and the agent complies. (But the unified goal $G_{Σ}$ implicit in the agent's design isn't quite $R$ , so we get inner misalignment.)

So, in this very round-about way, we get a goal-maximizer.

I think that heuristics are reflections of historical facts of that form, but not statements themselves.

Does "evidence of historical facts of that form" work for you?

You rely a lot on selection-level reasoning in a way which feels sketchy.

Specific examples? I specifically tried to think in terms of local gradients ("in which direction would it be advantageous for the SGD to move the model from this specific point?"), not global properties ("what is the final mind-design that would best satisfy the SGD's biases, regardless of the path taken there?"). Or do you disagree with that style of reasoning as well?

What is the difference, on your view, between a WM which is "explicit" and one which e.g. has an outgoing connection from is-cat circuit to is-animal?

I've outlined some reasons above — the main point is whether it's accessible to the GPS/deliberative planner, because if it is, it allows WM-concepts to be employed much more flexibly and generally.

(I'm actually planning a separate post on this matter, though.)

If the agent is doing SSL on its future observations and (a subset of its) recurrent state activations, then the learning process would presumably train the network to reflectively predict its own future heuristic-firings

Yeah, but that's not shards making themselves legible, that's a separate process in the agent trying to build their generative models from their externally-observed behavior, no?

Furthermore, there should be some consistent formatting since the heuristics are functions $h : M_{s} \to A_{s}$

Consistent input-output formatting, sure: an API, where each shard takes in the WM, then outputs stuff into the planner/the GPS/the bid-resolver/the cross-heuristic communication channel/some such coordination mechanism.

That's not what I'm getting at. It still wouldn't allow to predict what a shard will do without observing its actions. No consistent design structure, where each shard has a part you can look at and go "aha, that's what it's optimizing for!". No meta-data summary/documentation to this effect attached to every shard.

And under certain "simplicity pressures/priors", heuristics may reuse each other's deliberative machinery

Agreed; I think I mention that in the post, even. Issue: such structures would be as ad-hoc as the shards' inner implementation. You wouldn't get alliances that change at runtime, where shards can look at each other's local incentives and choose to e. g. "engage in horse-trading", or where they can somehow "figure out" that some other shard is doing the same thing they're doing in this specific context only and so only re-use its activations in that context.

No, you'd just get some shards that are hard-wired to always fire with some other shards, or always inhibit some other shards. These alliances can be rewritten by the reward circuitry, but not by the shards themselves.

That doesn't require all shards to be legible to each other; that just requires there to be gradients towards some specific chains of shard activations.

I don't positively buy reasoning about whether "deceptive alignment" is probable

My outline of it here is also written with local gradients, not global selection targets, in mind. You might want to check it out?

Just one heuristic for all of chess?

Yeah, no. I recall wanting to make an aside like "obviously in practice chess-winning will be implemented via a lot of heuristics", but evidently I didn't.

Can you give me an example of what [value compilation] means for a network which has a diamond-shard and an ice-cream-eating shard?

First, note that I'm not saying that $G_{Σ}$ is necessarily "simple", as e. g. a hedonist's desire for pleasure. It can have many terms that can't be "merged" together. I'm just saying that we have an impulse to merge terms as much as possible. This is one of the cases where they can't be merged.

As per 6A, that would go as in the "disjunction" section. I. e., the agent would figure out tradeoffs it's willing to make WRT diamonds and ice cream, and then go for plans that maximize the weighted sum of diamonds-and-ice-cream it has.

... Alright, I see your point about "utility is not the optimization target": there's no inherent reason to think it'd want as many of these things as possible. E. g., ice-cream shard's activation power may be capped at 1000 ice creams, and the agent may interpret it as a hard limit. But okay, so then it'd try to maximize the probability of achieving that utility cap, or the time it'd stay in the max-utility state, or something along those lines.

Like... There are states in which shards activate, and where they're dormant. Thus, shards steer the agents they're embedded in towards some world-states. Interpreting/reverse-engineering this behavior into goals, it seems natural to view it as "I want to be in such world-states over such others". And then the GPS will be tasked with making that happen, and...

Well, it would try to output a "good" plan for making it happen, for some definition of "good". And... you disagree that this definition has to lead to arg-maxing, okay.

I guess instead of maximizing we can satisfice: as you describe here, we can just generate a bunch of plans and choose one that seems good enough, instead of generating the best possible plan. But:

As agents become more powerful, it becomes easier for them to generate insanely good plans with trivial effort, so we have no guarantees the first idea the hyperintelligent AI would come up with won't be basically utility-maximizing.
That only applies if the agent's preferences themselves aren't maximizable: if it didn't decide its goal is to have "AS MANY diamonds as possible" instead of "at most 1000 diamonds", or if it doesn't have some instrumental goal like "MINIMIZE uncertainty".
I'm... not sure humans don't do grader-optimization? It seems like if we all had magical question-answering devices, we'd go around asking them for "the best, most resource-efficient plan for X" all the time. We just don't have the mental resources for it, ordinarily! It's as I'd described before: we maximize over (plan quality, resources spent on planning), not plan quality only.

(Re: magical question-answerers, yeah, we'd also want a provision like "but interpret that ask faithfully instead of doing a technical genie". But that's not an issue if the agent is the one doing the planning. Like, it doesn't prompt some separate plan-making module that it has reason to fear would output something that hacks/Goodharts it. It just consciously tries to come up with "a very good plan", and it's just so smart it has a lot of slack on optimizing that plan along dimensions like "probability of success" and "the optimal world-state will be very stable". And then that washes away everything in the universe that the agent is not explicitly optimizing for.)

Seems like assuming "activation strengths increase the further WM values are from target values" leads us to this bizarre GPS goal. While that proposition may be true as a tendency, I don't see why it should be true in any strict sense, or if you believe that, or whether the analysis hinges on it?

I think no, it doesn't hinge on it, as per the section just above? All we need is for shards to have some preferences for certain world-model-states over others.

Don't you mean "figuratively pursuing $G_{Σ}$ "? How would one "pursue" contextual behaviors?

In the vacuous way where any agent could be said to maximize what they're already doing? I did say "figuratively".

Same objection as above -- to "achieve" $B_{Σ}$ ? How do you "achieve" behaviors? And, what, this would happen how? What part of training are we in? What is happening in this story, is SGD optimizing the agent to be runtime-adaptive, or..?

... Yeah, okay, that phrasing is very bad. What I meant is: Suppose we have a shard that tries to figure out from where a predator could ambush the agent from. Before the GPS, it had some ad-hoc analysis heuristic that was hooked up to a bunch of WM concepts. After the GPS, that shard can instead loop general-purpose planning in, prompt it with "figure out from where the predator can ambush us, here's some ideas to start", and the GPS would do better than the shard's own ad-hoc algorithm.

Hence, we'll get an agent that would "get better at what it was already doing".

I agree that "become more effective at achieving $B_{Σ}$ " is a pretty nonsensical way to put it, though.

Flag that I wish you would write this as "during additional training, the interim model performance can be at least as U-unperformant as the contextual behaviors."

Sure.

I think that heuristics controlling GPS-machinery is probably where the GPS comes from to begin with, so this step doesn't seem necessary.

Agreed; also think I mentioned that in a footnote. I'm not sure, though, and I think we can design some weird training setups where the GPS might first appear in the WM or something (as part of a simulated human?), so my goal here was to show that the process would go this way regardless of where the GPS originated.

I agree that the way I phrased that there is weird, though.

What? Why? Why would a value-human shard bid for plans generated via GPS which involve people dying?

I don't think it'd bid for such plans. I think shards have less decision-making power in advanced agents, compared to the GPS' interpretation of shards' goals. Inasmuch as there would be imperfections in the value-humans shard's caring, the GPS would uncover them, and exploit them to make that shard play nicer with other shards.

E. g., suppose the value-humans shard isn't as upset as we would be if a human got their thumb torn off (and is anomalously non-upset about any second-order effects of that, etc.; it basically ignores tear-a-thumb-off plans), and there's some shard like "sadistic fun" that really enjoys seeing humans get their fingers torn off. Even if the value-humans shard is much more powerful, the GPS' desire to integrate all its values would lead to it adopting some combination value where it thinks it's fine to tear people's fingers off for fun.

That's not a realistic example, but I hope it conveys the broader point: any imperfections in value-humans will be exploited by the rest of the shard economy, and the broader process that tries to satisfy the goal implicitly embodied by the shard economy.

And then, even if the value-humans shard is perfect, the AI might just figure out some galaxy-brained merger of it with a bunch of other shards, that makes logical sense to it as an extrapolation, and just override the value-humans shard's protests. (Returning to a previous example: Suppose we've adopted "avoid ambush predators" as our explicit goal, then ended up in a forest environment where we're ~100% sure there are no predators. The "be afraid of crunchy noises at night" shard would activate, but we'd just dismiss it, because we know it has no clue and we know better.)

I use "values" to be decision-influence

Mm, I dispute that choice. I think "value" has the connotation of "sacred value" and "terminal value" and "something the agent wouldn't want to change about themselves", and that doesn't clearly map onto "a consistent way the agent's decisions are steered"? My broad point, here, is that shards-as-decision-infuencers aren't necessarily endorsed by agents in their initial form, and calling them "values" conveys wrong intuitions (for my purposes, at least).

I prefer "proto-values" for shards-when-viewed-as-repositories-of-contextual-goals, and... Yeah, I don't think I even have anything in my model that works well for "value". "Intermediary values" as a description of contextual goals, maybe.

Aside: I think self-awareness arises from elsewhere in the shard ecosystem.

Would be interested in your model of that!

[-]Dalcy3y41

Okay, more questions incoming: "Why would GPS be okay with value-compilation, when its expected outcome is to not satisfy in-distribution context behaviors through big-brain moves?"

If I understood correctly (can be skipped; not relevant to my argument, which starts after the bullet points):

Early in training, GPS is part of the heuristic/shard implementation (accessed via API calls)
Middle in training, there is some SGD-incentive towards pointing GPS in the direction of "reverse-engineering heuristic/shard firing patterns, representing them (Gs) as WM variables, and pursuing them directly"
Later in training, there is some SGD-incentive towards pointing GPS to do "value-compilation" by merging/representing more abstract i.e. "compiled" versions of the previous goal pointers and directly pursuing them.
- In other words, there are some GPS API calls that say "do value compilation over GPS targets", and some (previously formed) GPS API calls that say "pursue this reverse-engineered-heuristic-goal"
Very late (approx SGD influence ~ 0 due to starvation + hacking, where we can abstractize SGD as being extensions of GPS), everything is locked-in
- Tangent: we can effectively model the AI as subagents, with each subagent is a GPS API-call. It wouldn't be exactly right to call them "shards," and it wouldn't be right to call them "GPS" either (because GPS is presumably a single module; maybe not?). A new slick terminology might be needed ... I'll just call them "Subagents" for now.

To rephrase my question, why would the other (more early-formed) GPS API calls be okay with the API calls for value-compilation?

As you mentioned in a different comment thread, there is no reason for the GPS to obey in-distribution behavior (inner misalignment). So, from the perspective of a GPS that's API-called with pursuing a reverse-engineered-heuristic-goal, it would think:

"hm, I notice that in the WM & past firing patterns, several goal-variables are being merged/abstractized and end up not obeying the expected in-distribution context behaviors, presumably due to some different GPS-API-call. Well, I don't want that happening to my values—I need to somehow gradient hack in order to prevent that from happening!"

I think this depends on the optimization "power" distribution between different GPS API-calls (tangent: how is it possible for them to have different power when the GPS, presumably, is a modular thing and the only difference is in how they're called? Whatever). Only if the API call for value compilation overwhelms the incentive against value compilation from the rest of the API calls (which all of them have an incentive for doing so, and would probably collude) then can value compilation actually proceed—which seems pretty unlikely.

Given this analysis, it seems like the default behavior is for the GPS API-calls to gradient hack away whatever other API-calls that would predictably result in in-distribution behaviors not getting preserved (e.g., value-compilation).

Alternate phrasing: The "Subagents" (check earlier bullet point for terminology) will have an incentive themselves to solve inner misalignment. Therefore, "Subagents" like "Value-compilation-GPS-API-call" that may do dangerous "big-brain" moves are naturally the enemy of everyone else, and will be gradient-hacked away from existence.

Is there any particular reason to believe the GPS API-calls-for-value-compilation would be so strongly chiseled in by the SGD (when SGD still has influence) as to overwhelm all the other API-calls (when SGD stops mattering)?

[-]Thane Ruthenis3y40

For reference, I think you've formed a pretty accurate model of my model.

Given this analysis, it seems like the default behavior is for the GPS API-calls to gradient hack away whatever other API-calls that would predictably result in in-distribution behaviors not getting preserved (e.g., value-compilation).

Yup. But this requires these GPS instances to be advanced enough to do gradient-hacking, and indeed be concerned with preventing their current values from being updated away. Two reasons not to expect that:

Different GPS instances aren't exactly "subagents", they're more like planning processes tasked to solve a given problem.
- Consider an impulse to escape from certain death. It's an "instinctive" GPS instance; the GPS has been prompted with "escape certain death", and that prompt is not downstream of abstract moral philosophy. It's an instinctive reaction.
- But this GPS instance doesn't care about preventing future GPS instances from updating away the conditions that caused it to be initiated (i. e., the agent's tendency to escape certain death). It's just tasked with deriving a plan for the current situation.
- It wouldn't bother looking up what abstract-moral-philosophy is plotting and maybe try to counter-plot. It wouldn't care about that.
- (And even if it does care, it'd be far from its main priority, so it wouldn't do effective counter-plotting.)
The hard-coded pointer to value compilation might plainly be chiseled-in before the agent is advanced enough to do gradient-hacking. In that case, even if a given GPS instance would care to plot against abstract values, it just wouldn't know how (or know that it needs to).

That said, you're absolutely right that it does happen in real-life agents. Some humans are suspicious of abstract arguments for the greater good and refuse to e. g. go from deontologists to utilitarians. The strength of the drive for value compilation relative to shards' strength is varying, and depending on it, the process of value compilation may be frozen at some arbitrary point.

It partly falls under the meta-cognition section. But in even more extreme cases, a person may simply refuse to engage in value-compilation at all, express a preference to not be coherent.

... Which is an interesting point, actually. We want there to be some value compilation, or agents just wouldn't be able to generalize OOD at all. But it's not obvious that we want maximum value compilation. Maximum value compilation leads to e. g. an AGI with a value-humans shard who decides to do a galaxy-brained merger of that shard with something else and ends up indifferent to human welfare. But maybe we can replicate what human deontologists are doing, and alter the "power distribution" among the AGI's processes such that value compilation freezes just before this point?

I may be overlooking some reason this wouldn't work, but seems promising at first glance.

[-]Dalcy3y40

Different GPS instances aren't exactly "subagents", they're more like planning processes tasked to solve a given problem.

You're right that GPS-instances (nice term btw) aren't necessarily subagents—I missed that your GPS formalization does argmin over WM variable for a specific t, not all t, which means it doesn't have to care about controlling variables at all time.

With that said ...

(tangent: I'm still unsure as to whether that is the right formalization for GPS—but I don't have a better alternative yet)
... there seems to be a selection effect where GPS-instances that don't care about preserving future API-call-context gets removed, leaving only subagent-y GPS-instances over time.
- An example of such property would be having a wide range of t values for the problem specification.

Generalizing, when GPS is the dominant force, only GPS-instances that care about preserving call-context survives (and eg surgeries-away all the non-preserving ones), and then we can model the AI as being composed of actual subagents.

this sounds like a fairly binary property—either you care (and hence participate in inter-subagent game theory) or you don't (and disappear).

Now, there are three important questions: (1) does greedy SGD incentivize "care" for preserving future call-contexts? (2) if so, what are the late-stage dynamics? (3) and for which GPS-instances do "care" occur?

What would make it so that SGD chisels in GPS API-calls that "care" about preserving future call-contexts?
- SGD-dominant training phase: Well, by accident. Possibility:
  - certain heuristics may require setting the GPS t value across a very wide range, accidentally making it "care" about preservation more so than others.
- SGD-GPS-transition phase: GPS starts to do some very crude gradient hacking. It's really crude, so there's some SGD incentive to point the GPS API-calls for all heuristics to strategize against it.
  - now, all heuristics are slowly becoming more preserving.
- GPS-dominant phase: Those that care about preservation survives, and surgeries away the non-caring API-calls.
What matters in the dynamics at the GPS-dominant phase? Ideas:
- "Relative power" matters the most
  - e.g., even if some GPS-instance cared about its goals for all t, if powerless, other more-powerful-and-less-caring instances just won't give it a say.
- "Range of care" matters
  - e.g., some instances care for t:0~10. some care for t:80~95. some care for t:37~999. they might trade to give different instances more "power" for different times as to be pareto-efficient in some way.
    - note: again, not sure this is the right ontology to think about GPS calls (like implicitly using real-valued range of continuous t values)
What kind of GPS-instances are more likely to "care" more?
1. e.g., for value-compilation to have the kind of impact described in the post, it should probably (1) have sufficient "power" as to not get ignored in the trading, and (2) have an extremely wide range of "care." it's not obvious that this is the case.

But it's not obvious that we want maximum value compilation. Maximum value compilation leads to e. g. an AGI with a value-humans shard who decides to do a galaxy-brained merger of that shard with something else and ends up indifferent to human welfare.

Nice insight! Perhaps we get a range of diverse moral philosophies by tweaking this single variable "value-compilation-degree," with deontology on the one end and galaxy-brained merger on the other end.

Combining with the above three questions, I think a fairly good research direction would be to (1) formalize what it means for the GPS-instances to have more "power" or have a "range of care," (2) and how to tweak/nudge these values for GPS-instances of our choosing (in order to e.g., tweak the degree of value compilation).

[-]Thane Ruthenis3y20

... there seems to be a selection effect where GPS-instances that don't care about preserving future API-call-context gets removed, leaving only subagent-y GPS-instances over time.

You're not taking into account larger selection effects on agents, which select against agents that purge all those "myopic" GPS-instances. The advantage of shards and other quick-and-dirty heuristics is that they're fast — they're what you're using in a fight, or when making quick logical leaps, etc. Agents which purge all of them, and keep only slow deliberative reasoning, don't live long. Or, rather, agents which are dominated by strong deliberative reasoning tend not to do that to begin with, because they recognize the value of said quick heuristics.

In other words: not all shards/subagents are completely selfish and sociopathic, some/most want select others around. So even those that don't "defend themselves" can be protected by others, or not even be targeted to begin with.

Examples:

A "chips-are-tasty" shard is probably not so advanced as to have reflective capabilities, and e. g. a more powerful "health" shard might want it removed. But if you have some even more powerful preferences for "getting to enjoy things", or a dislike of erasing your preferences for practical reasons, the health-shard's attempts might be suppressed.
A shard which implements a bunch of highly effective heuristics for escaping certain death is probably not one that any other shard/GPS instance would want removed.

[-]Dalcy3y30

I think the argument beyond 5D hinges on the fact that Bs and Gs will be represented in the WM such that the GPS can take it as part of the problem specification.

Arguments in favor:

GPS has been part of the heuristics (shards), so it needs to be able to use their communication channel. This implies that the GPS reverse-engineered the heuristics. Since GPS has write access to the WM, that implies the Bs and Gs might be included there.
Once included in the WM, it isn't too hard for the SGD to point the GPS towards it. At that point, there's a positive feedback loop that incentivizes both (1) pointing GPS even more towards Bs and Gs and (2) making B/G representation even more explicit in the WM.

Arguments against:

By the point 5D happens, GPS should already be well-developed and part of the heuristics, which means they would be a very good approximation of U. This implies strong gradient starvation, so there just might not be any incentive for SGD to do any of this.
If GPS becomes critically reflective before sufficient B/G representation in the WM, then gradient hacking locks in the heuristic-driven-GPS forever.

So, it's either (1) they do get represented and arguments after 5D holds, or (2) they don't get represented and the heuristics end up dominating, basically the shard theory picture.

I think this might be one of the line that divides your model with the Shard Theory model, and as of now I'm very uncertain as to which picture is more likely.

[-]Thane Ruthenis3y20

By the point 5D happens, GPS should already be well-developed and part of the heuristics, which means they would be a very good approximation of U.

My argument is that they wouldn't actually be a good cross-context approximation of U; in part because of gradient starvation.

E. g., suppose we're training a human to be an utilitarian, and we're doing it on a dataset of the norms of a particular society. By default, the human would learn said norms, and then stop there, because following the norms is good enough for making people happy in-distribution. If we then displace them to a different society, they'd try to act on their previous society's norms, and that's not going to make people in the new society happy.

To handle such distribution shifts, we instill a desire to do value-compilation into the human, figure out what their current values are for, and then care about the output of value compilation and ignore the inputs to value compilation (the initial norms).

So we get someone who starts out with local norms, figures out they're for making people happy, and when they move, they figure out what makes people happy in the new society, and re-derive all the new heuristics for that.

It's a sort of hack to avoid gradient starvation.

[-]Dalcy3y10

My argument is that they wouldn't actually be a good cross-context approximation of U; in part because of gradient starvation.

Ah bad phrasing—where you quoted me (arguments against part) I meant to say:

Heuristic-driven-GPS is a very good approximation of U only within in-distribution context
... and this is happening at a phase where SGD is still the dominant force
... and Heuristic-driven-GPS is doing all this stuff without being explicitly aimed towards Bs and Ds, but rather the GPS is just part of the "implicit" M -> A procedures/modules
... therefore "gradient starvation would imply SGD won't have incentives to represent Bs and Ds as part of the WM"

(I'm intentionally ignoring the value compilation and focusing on 5D because (1) it seems like having Bs and Ds represented is a necessary precursor for all the stuff that comes after that which makes it a useful part to zoom at, and (2) I haven't really settled my thoughts/confusions on value-compilation)

Does my arguments in favor/against in the original comment capture your thoughts on how likely it is that Bs and Ds get represented in the WM? And is your positive conclusion because any one of them seem more likely to matter?

[-]Thane Ruthenis3y20

Oops, I think the confusion is about what counts as "in-distribution", probably because I myself used it inconsistently just now. In my other comment, I'd referred to training on a single society as "in-distribution", but in the previous comment in this thread, "displace the human to a different society" was supposed to be part of the training.

Suppose that, as above, we're trying to train an utilitarian.

Imagine that instead of a single environment, we have a set of environments, e. g. a set of societies with different norms. Every society represents a different distribution, such that if we train an AI on a single society's norms, every other society would be OOD for it.

If we train on a single society, then gradient starvation would set in as you're describing: the AI would adopt a bunch of shallow heuristics and have no incentive to develop the value-compilation setup.

But imagine if we're training on different societies, often throwing in societies that are OOD relative to the previous training data. It'd need to learn some setup for re-aligning its heuristics towards U even in completely unfamiliar circumstances — which I hypothesize to be value compilation as described here.

Thus, gradient starvation would never actually set in at the level of shallow heuristics.

(Instead, it'd set in at the level of value compilation — once that process consistently spits out a good proxy of U, the SGD would have no incentive to further align it; and that U-proxy may be quite far from utilitarianism.)

[-]Dalcy3y30

It seems that the greedy selection algorithms and the AI itself can be trusted with approximately 0% of the alignment work, and approximately 100% of it will need to be done by hand.
...
(in the comment)
even if the value-humans shard is perfect, the AI might just figure out some galaxy-brained merger of it with a bunch of other shards, that makes logical sense to it as an extrapolation, and just override the value-humans shard's protests.

Would the galaxy-brained merger necessarily be a bad example? (I'm confused as to what you think the "end goal" of alignment should look like—what would even be a good example of a well-compiled value, behaving in totally OOD environments?)

If I understood correctly, the compiled-value is what lets the AI extrapolate its values to OOD environments via the virtue of abstraction (which may look totally alien to us).

But those compiled values should still do well in-distribution, if the greedy training dynamic is still in play!

So, (assuming the AI is trained in the limit using ~infinite data that's representative of in-distribution human day-to-day-life/examples of "good-values" being executed) the galaxy-brain-compiled-value should still act in accordance to our values in in-distribution contexts, which includes e.g., not killing everyone.

Or maybe I misunderstood. Alternatively, perhaps your galaxy-brain example was set during which the GPS already overpowered SGD and can pursue whatever mesa-objective it had earlier on. Value-compilation would be part of it, since it would've been an adaptive strategy during the greedy training process (i.e. better generalization capability to score well on in-distribution training data).

But even in this case, I think it's still plausible that the GPS will value-compile and abstractize its values in a way that respects behaviors in in-distribution (especially since that's what it had to do during training!)

So overall, I guess my questions are:

What even would an good OOD extrapolation of human values look like?
(relatedly) Why do you think the galaxy-brained merger is bad?
Is my understanding: "GPS's tendency to value-compile initially forms when SGD is still a dominant force in the training dynamic" correct?
And when SGD loses control and GPS's value-compilation becomes the dominant force, how would that value-compilation generalize? Would it do so in a way that respects earlier-in-training in-distribution samples? (like not killing humans)

[-]Thane Ruthenis3y30

Alternatively, perhaps your galaxy-brain example was set during which the GPS already overpowered SGD and can pursue whatever mesa-objective it had earlier on

Yup. It doesn't necessarily have to respect in-distribution behavior. To re-use an example:

Suppose that you have a shard that looks for conditions like "you're at night in a forest and you heard rustling in the grass behind you", then floods your system with adrenaline and gets you ready for a fight. Via self-reflection, you realize that this implements a heuristic that tries to warn you about a potential attack. You reason that this means you value not getting attacked. You adopt "not getting attacked" as your goal.

Then, you switch contexts. You're in a different forest now, and you're ~100% sure there are no dangerous animals or people around. You hear rustling in the grass behind you. You know it's not a threat, so you suppress the shard's insistence to turn around and get ready for a fight.

You do not respect your in-distribution behavior.

Same can go for e. g. moral principles. Suppose you grew up in a society with a lot of weird norms, like "it's disrespectful to shake people awake", and then did reflection on these learned norms and figured they're optimized for making people happy. You adopt "make people happy" as your value, and then end up moving. In the new society, there are different norms. Instead of being confused by norm-incompatibility, you just figure out what behaviors in this society make people happy, and do them.

Or, again, deontology to utilitarianism. "Don't kill people" and such rules are optimized for advancing human welfare, but if you adopt "human welfare" as your explicit goal, you may sometimes violate that initial rule to e. g. kill a serial killer.

The concern, here, is that the AGI may do something like this with "keep humanity around". That it's just a local instantiation of some higher-level principle, that can be served better by killing humans off and replacing them with something else — like there's no need to respect "be afraid of rustling" if the forest has no predators, or "don't shake people awake" if the person you're shaking awake doesn't mind.

What even would an good OOD extrapolation of human values look like?

No idea. I mean, it presumably involves building an immortal eudaimonic utopia, whatever "eudaimonia" means, but no I haven't solved the entirety of moral philosophy. Just developed a model for describing the process of moral philosophy.

Why do you think the galaxy-brained merger is bad?

See above.

Is my understanding: "GPS's tendency to value-compile initially forms when SGD is still a dominant force in the training dynamic" correct?

Yup.

And when SGD loses control and GPS's value-compilation becomes the dominant force, how would that value-compilation generalize? Would it do so in a way that respects earlier-in-training in-distribution samples? (like not killing humans)

The concern is that it wouldn't respect in-distribution samples. Because of inner misalignment: it wouldn't generalize values into the actual outer objective, it'd generalize them towards some not-that-good correlate of the actual training objective (like human values are a proxy goal for "maximize inclusive genetic fitness").

[-]TurnTrout3yΩ330

This post posits that the WM will have a "similar format" throughout, but that heuristics/shards may not. For example, you point out that the WM has to be able to arbitrarily interlace and do subroutine calls (e.g. "will there be a dog near me soon" circuit presumably hooks into "object permanence: track spatial surroundings").

(I confess that I don't quite know what it would mean for shards to have "different encodings" from each other; would they just not have ways to do API calls on each other? Would they be written under eg different "internal programming languages"? Not reuse subroutine calls to other shards?)

Keeping in mind that maybe I am only reasoning syntactically right now, here are more considerations:

I think it's non-trivial to know why I'm worried about something / why I think e.g. someone is mad at me (I remember being quite bad at this before learning Focusing). This seems like weak evidence that "what do I predict" is not substantially more self-interpretable than "why did I act that way", which is in turn weak evidence that WM formatting is about as consistent as shard formatting.
Shards also probably reuse machinery. In particular, "grab the cup" should not be duplicated across juice- and milk-shards. This suggests e.g. a motor coordination API will arise. Eventually the generalization of these APIs probably turns into the GPS (but that's just my current gut guess).
The WM can literally invoke shards in rollout predictions, where it computes what (conditional on e.g. setting dog-nearby to True) the shard bids are. So I don't see why shards should be black-boxy to the WM/GPS?
SSL updates may push shards to be more predictable to WM, so as to reduce predictive error, which seems to me like it pushes towards uniform shard encoding.

[-]TurnTrout3yΩ330

That is: is some function that takes in a world-model and some "problem specification" — some set of nodes in the world-model and their desired values — and output the actions that, if taken in that world-model, would bring the values of these nodes as close to the desired values as possible given the actions available to the agent.

While I appreciate the concreteness, this doesn't seem very reasonable to me.^[1] But maybe I'm misunderstanding!^[2]

Concretely, imagine I want to buy ice cream. I understand the GPS to receive target specification $M_{s}^{G}$ set to "I have ice cream in my hand in three hours." I don't think that I will then quickly argmin expected latent distance by searching over all relevant plans. That would, I think, lead to some crazy outcomes. Solutions at least as ice-cream-effective as "use all my money to hire many friends to buy me ice cream from different places."

And if we posit that the other variables are also set to reasonable values, I will object "this is not lazy enough (in a programming sense), you're asking too much of target configurations; I will in fact be surprised if amazing brain-reading indicated that I am in fact tracking complete target latent state specifications when I submit plans to my GPS."

I will also object that I don't think that min is at all an effective approach to real-world cognition, nor do I think it's what gets learned with realistic learning processes. I think the GPS itself will be a set of heuristics (like "check recent memory bank for target-goal-relevant information" and "use generative model to suggest five candidate goal-states and let invocation-subshards bid on them to rank them, selecting the best-of-five") which are reliably useful across shard-goals (like ice cream and dog-petting).

(Am I engaging with your intended points properly, or am I whooshing?)

^{^}
Rule of thumb: If I find myself postulating internal motivational circuitry which uses a "max" or a "min", then I should think very carefully about what is going on and whether that's appropriate. Almost always, the answer is "no", and if I don't catch the "min/max" before it sneaks in, my analysis goes off the rails of reality.
^{^}
I haven't even read much further, just skimmed other parts of the post. I'll post this now anyways.

[-]Thane Ruthenis3yΩ110

I don't think the GPS "searches over all relevant plans". As per John's post:

Consider, for example, a human planning a trip to the grocery store. Typical reasoning (mostly at the subconscious level) might involve steps like:
There’s a dozen different stores in different places, so I can probably find one nearby wherever I happen to be; I don’t need to worry about picking a location early in the planning process.
My calendar is tight, so I need to pick an open time. That restricts my options a lot, so I should worry about that early in the planning process.
<go look at calendar>
Once I’ve picked an open time in my calendar, I should pick a grocery store nearby whatever I’m doing before/after that time.
… Oh, but I also need to go home immediately after, to put any frozen things in the freezer. So I should pick a time when I’ll be going home after, probably toward the end of the day.
Notice that this sort of reasoning mostly does not involve babbling and pruning entire plans. The human is thinking mostly at the level of constraints (and associated heuristics) which rule out broad swaths of plan-space. The calendar is a taut constraint, location is a slack constraint, so (heuristic) first find a convenient time and then pick whichever store is closest to wherever I’ll be before/after. The reasoning only deals with a few abstract plan-features (i.e. time, place) and ignores lots of details (i.e. exact route, space in the car’s trunk); more detail can be filled out later, so long as we’ve planned the “important” parts. And rather than “iterate” by looking at many plans, the search process mostly “iterates” by considering subproblems (like e.g. finding an open calendar slot) or adding lower-level constraints to a higher-level plan (like e.g. needing to get frozen goods home quickly).

In particular, I very much do agree the GPS makes use of heuristics like "if you have a cached plan that you think will work, just do that" and "see [how you feel about this idea]^[1] before proceeding" over the course of planning. But it's not made of heuristics; rather, it's something like a systematic way of drawing upon the declarative knowledge/knowledge explicitly represented in the world-model, and that knowledge involves a lot of heuristics.

Crucially, part of any "problem specification" would be things like "how much time should I spend on thinking about this?" and "how hard should I optimize the plan for doing it?" and "in how much detail should I track the consequences of this decision?", and if it's something minor like getting ice cream, then of course you'd spend very little time and use a lot of cached cognitive shortcuts.

If it's something major, however, like a life-or-death matter, then you'd do high-intensity planning that aims to track what would actually happen in detail, without relying on prior assumptions and vague feelings^[2].

^{^}
I. e., which of your shards bid for or against it, and how strongly.
^{^}
Unless, of course, some of these vague feelings have proven more effective in the past than your explicit attempts at consequences-tracking, in which case you'd knowingly defer to them — you'd "trust your instincts".

[-]TurnTrout3yΩ220

I don't think the GPS "searches over all relevant plans"

OK, but you are positing that there's an argmin, no? That's a big part of what I'm objecting to. I anticipate that insofar as you're claiming grader-optimization problems come back, they come back because there's an AFAICT inappropriate argmin which got tossed into the analysis via the GPS.

But it's not made of heuristics; rather, it's something like a systematic way of drawing upon the declarative knowledge/knowledge explicitly represented in the world-model, and that knowledge involves a lot of heuristics.

Sure, sounds reasonable.

Crucially, part of any "problem specification" would be things like "how much time should I spend on thinking about this?" and "how hard should I optimize the plan for doing it?" and "in how much detail should I track the consequences of this decision?", and if it's something minor like getting ice cream, then of course you'd spend very little time and use a lot of cached cognitive shortcuts.

Noting that I still feel confused after hearing this explanation. What does it mean to ask "how hard should I optimize"?

If it's something major, however, like a life-or-death matter, then you'd do high-intensity planning that aims to track what would actually happen in detail, without relying on prior assumptions and vague feelings

Really? I think that people usually don't do that in life-or-death scenarios. People panic all the time.

[-]Thane Ruthenis3yΩ11-2

What does it mean to ask "how hard should I optimize"?

Satisficing threshold, probability of the plan's success, the plan's robustness to unexpected perturbations, etc. I suppose the argmin is somewhat misleading: the GPS doesn't output the best possible plan for achieving some goal in the world outside the agent, it's solving the problem in the most efficient way possible, which often means not spending too much time and resources on it. I. e., "mental resources spent" is part of the problem specification, and it's something it tries to minimize too.

I don't think this argmin is the central reason for grader-optimization problems here.

Really? I think that people usually don't do that in life-or-death scenarios. People panic all the time.

I'm assuming no time pressure. Or substitute-in "a matter of grave importance that you nonetheless feel capable of resolving".

[-]TurnTrout3yΩ220

I don't think this argmin is the central reason for grader-optimization problems here.

I'm going to read the rest of the essay, and also I realize you posted this before my four posts on "holy cow argmax can blow all your alignment reasoning out of reality all the way to candyland." But I want to note that including an argmin in the posited motivational architecture makes me extremely nervous / distrusting. Even if this modeling assumption doesn't end up being central to your arguments on how shard-agents become wrapper-like, I think this assumption should still be flagged extremely heavily.

[-]Thane Ruthenis3yΩ110

Mm, I believe that it's not central because my initial conception of the GPS didn't include it at all, and everything still worked. I don't think it serves the same role here as you're critiquing in the posts you've linked; I think it's inserted at a different abstraction level.

But sure, I'll wait for you to finish with the post.

^{^}

We'll return to them in Section 9.

^{^}

This function can be arbitrarily complex, too — maybe even implementing some complex "negotiations" between heuristics-as-shards. Indeed, this is plausibly the feature from which the GPS would originate in the first place! But this analysis tries to be agnostic as to the exact origins of the GPS, so I'll leave that out for now.

^{^}

And plausibly some shared observation pre-processing system, but I'll just count it as part of the world-model.

^{^}

Though this is potentially subject to the specifics of the training scheme. E. g., if the training episodes are long, or we're chaining a lot of forward passes together like in a RNN, that would make runtime-computations more effective at this than the SGD updates. That doesn't mean the speed prior is going to save us/reduce the path-dependence I'll go on to argue for here, because there'll still be some point at which the GPS-based at-runtime reverse-engineering outperforms selection-pressure-induced legibility. But it's something we'd want fine-grained data on.

^{^}

Second-order natural abstraction?

^{^}

Naively, this process would continue until the agent turns into a proper $U$ -optimizer. But it won't, because of gradient starvation + the deception attractor. There are other posts talking about this, but in short:

Once $G_{Σ}$ agrees with $U$ in 95% of cases, the selection pressure faces a choice between continuing to align $G_{Σ}$ , and improving the agent's ability to achieve $G_{Σ}$ . And it surely chooses the latter most of the time, because unless the agent is already superintelligently good at optimization, it probably can't actually optimize for $G_{Σ}$ so hard it decouples from $U$ .

Then, once the agent is smart enough, it probably has strategic awareness, wants to protect $G_{Σ}$ from the selection pressure, and starts trying to do deceptive alignment. And then it's in the deception attractor: its performance on the target objective rises sharper as its general capabilities improve (since that improves both the ability to achieve $U$ and the ability to figure out what it should be pretending to want), compared to improving its alignment.

^{^}

Note: This isn't a precisely realistic example of value compilation, for a... few reasons, but mainly the phrasing. Rather than "smoking" and "using a fabulous fidget toy", it should really say "an activity which evokes a particularly satisfying mixture of relaxation and self-affirmation".

There seems to be some tendency for values to increase in abstractness as the process of compilation goes on: earlier values are revealed to be mere "instantiations" of later values, such that we become indifferent to the exact way they're instantiated (see the cars vs. yachts example). It works if "relax" and "feel cool" are just an instantiation of "feel an emotion that's a greater-than-its-parts mix of both", such that we're indifferent to the exact mix. But they're not an instantiation of "smoke a cigar": if smoking ceased to help the person relax and feel cool, they'd stop smoking and find other ways to satisfy those desires.

^{^}

Although I imagine some interesting philosophy out of it.

^{^}

Or maybe not. Something about this feels a bit off.

^{^}

Note that this isn't the same as my disagreeing with the Shard Theory itself. No, I still think it's basically correct.

^{^}

You might argue that we can set up a meta-cognition shard that implacably forbids the AI's GPS from folding humanity away like this, the way something prevents deontologists from turning into utilitarians, or the way we wouldn't kill-and-replace a loved one with a "better" loved one. I'm not sure one way or another whether that'll work, but I'm skeptical. I think it'll increase the problem difficulty dramatically: that it'd require the sort of global robust control over the AI's mind that we can use to just perfect-align it.

^{^}

One idea here would be to wait until the AI does value compilation on its own, then hot-switch the $G_{Σ}$ it derives. That won't work: by the point the AI is able to do that, it'd be superintelligent, and it'll hack through anything we'll try to touch it with. We need to align it just after it becomes a GPS-capable mesa-optimizer, and ideally not a moment later.

^{^}

One issue I don't address here is that in order to do so, the GPS would need some basic meta-cognitive wrapper-structure and/or a world-model that contains self-referencing concepts — in order to know how to solve the problem of giving its future instances good follow-up tasks. I've not yet assessed the tractability of this. We might need some way to distinguish such structures from other heuristics, or figure out how to hand-code them.

LESSWRONG
LW

LESSWRONG
LW

34

Value Formation: An Overarching Model

34

Ω 17

34

Ω 17

On Procedural Knowledge

On Implicit Knowledge

Extrapolation

Core Claims

0. Introduction

1. The Setup

2. How Will the GPS (Not) Be Used?

3. Interfaces

4. Reverse-Engineering the Heuristics

5. The Wrapper Structure

5A. Assumption Re-Check

5B. The Interim Objective

5C. Looping Heuristics Back In

5D. Nurturing the Mesa-Optimizer

5E. Putting It Together

6. Value Compilation

6A. The Basic Algorithm

6B. Path-Dependence

6C. Ontological Crises

6D. Meta-Cognition

6E. Putting It Together

7. Miscellanea

8. Summary

9. Implications for Alignment

Bonus: The E-Coli Test for Alignment

10. Future Research Directions