Towards Gears-Level Understanding of Agency

Thane Ruthenis

Epistemic status: Highly speculative, much of this is probably wrong.

Summary: The "centerpiece" of this post is my attempt to describe how a full-fledged consequentialist agent might incrementally grow out of a bundle of heuristics, aided by SGD or evolution. The story's detail are likely very wrong. It's informed by a few insights that I'm more confident in, however, and suggests some interesting variables and developmental milestones of agency. Some specific claims:

Operating in a vast well-abstracting environment requires the ability to navigate arbitrary mathematical contexts.
That requires the ability to represent any mathematical environment internally.
In turn, this is possible if the agent possesses a functionally complete set of "mental primitives" — hard-coded conceptual building blocks. These primitives may be put together to define any object.
- For memory-efficiency reasons, such composite objects are then abstracted-over via an algorithm like this. Their internal structure is wiped, only the externally-relevant properties are kept.
- If the algorithm is imperfect, the resultant abstractions might be "leaky": carrying the traces of the primitives they're made of, or plain flawed.
- This creates severe problems in case of ontology crises: if the agent tries to translate its goal from one environment to another, redefining it and imperfectly abstracting over its re-definition.
Agency has the tendency to "take over" the neural network it appears in.
- It may start out as a minor and niche algorithm for context-learning/aid to built-in heuristics.
- Gradually (across generations/batches), it would expand its influence, holding more sway over the network's decisions.
- Eventually, it would maneuver the network into off-distribution environments where it hopelessly outperforms static heuristics, and eventually fully dominate them.
Fundamentally, agency is little more than a menagerie of meta-heuristics: heuristics for best ways to train new heuristics, including better meta-heuristics.
- In humans, the recursive self-improvement this implies is constrained by the static size of working memory + the fact that we can't modify the more basic features of our minds (like the mental primitives or the abstraction-building algorithm).

0. Background

In a previous post, I've made a couple relevant claims that I'm mostly confident in. I'll reframe and clarify them here.

0A. Universal Context-Learning → Agency

I'd argued that sufficiently advanced meta-learning is inseparable from agency/utility-maximization. I can distill the argument this way:

Suppose that we have an AI system with the ability to take any unfamiliar environment, derive its rules, and perform well in it according to some metric.
To do that, it would need to have a universal definition of "performing well": some way to judge its performance, so it may converge towards the optimal policy.
With limited domains, static heuristics and specialized algorithms (e. g., a calculator) suffice. They don't need to represent the metric internally. They can be glorified lookup tables, exhaustively mapping all possible states of their domains to outputs.
But for universality, our AI system needs the ability to map any new domain to the optimal set of heuristics for it. It needs the ability to replicate the work of producing the aforementioned heuristics/spec-algorithms.
In other words, it needs to be an optimizer: it needs to contain an inner optimization loop. Such a loop would have some metric it maximizes.
Whatever metric the system uses is its utility function, and lo, we have a utility-maximizer.

0B. Why Is Universality Useful?

Why would we expect all sufficiently powerful AI models to converge towards universality? At face value, this seems unnecessary, and even wasteful. Can't they just develop some heuristics for the specific domain they were trained for, like math, and be satisfied with that? Why drag in the likely-complex machinery for universality?

Well, they can, up to a certain level of domain complexity.

But consider the difficulties with modelling macro-scale objects using the heuristics for interactions between fundamental particles. It's completely computationally intractable. If you're not using approximations at all, you'd literally need a computer larger than the thing you're modelling. Larger than Earth, if you're modelling the human society.

Fortunately, our universe abstracts well. Fundamental particles combine into fiendishly complex structures, but you don't need to model every individual particle to predict the behavior of these structures well. Animals or stars could be thought of as whole objects, summarized by very few externally-relevant properties and rules.

But these rules are very different from the rules of fundamental particles. You can't predict humans by directly modelling them as the bundles of particles they are, or by applying particle-interaction rules to the behavior of human-type objects. When you build an abstraction, you define a new set of rules, and you need to be able to learn to play by them.

And the process of building new abstractions never ends. You observe the world, you build its macro-scale model. You notice some correspondences, drill down, and discover biology. Then chemistry, then quantum physics. On a separate track entirely, you abstract things away into governments and galaxies and weather patterns. None of these objects behave in quite the same way, and there are basically no constraints on what rules a future abstraction might have.

Similar issues would arise with many narrower yet still sufficiently complex domains.

Worse yet, the process of discovering new abstractions never ends even on any given abstraction layer. It's called chunking.

Consider math. When we define a new function, we define a new abstraction. That function behaves unlike any other object (else why would we have a separate symbol for it?), and thinking about it in terms of its definition is unwieldy and memory-taxing. You derive its rules, then think of it as its own thing.

Consider social heuristics. When you learn a new tell, like a certain microexpression a person might flick when they're lying, you learn a new abstraction.

Or when you discover a new type of star, or a new economic dynamic, or a specific chess board state.

Thus, computationally efficient world-modelling in the conditions of a vast well-abstractable world requires the ability to navigate arbitrary mathematical environments.

And to navigate arbitrary mathematical environments, you need to be an optimizer.

1. Functionally Complete Sets of Mental Primitives

What does a system need, to be universal?

By and large, we can only define new concepts in terms of the concepts we already know.^[1] When you notice a new pattern in the environment, you either understand it in terms of its constituent parts (a social group is a group of humans) or by creating a metaphor/analogy (atoms are like billiard balls).

But how does that process start? If you can only define new concepts in terms of known ones, how do you learn the first ones?

Well, you don't. Some of them are hard-coded into your psyche at deployment. Let's call these abstractions "mental primitives": abstract objects that the mind starts out with, that aren't defined in terms of anything else.

General intelligence, then, requires a functionally complete set of mental primitives. If your "default" abstractions could be put together to define any mathematical object, you could define any mathematical object, and so understand any environment.

What that starting set is matters little. It may be some mathematical axioms, or objects and rules in a platformer game, or some spatial and conflict-oriented concepts humans likely start with. Or even just NAND gates. Once you have that, you can bootstrap yourself to understand and model anything.

In the limit of infinite working memory, at least. Bounded agents would run into the same issue I've outlined in 0B: trying to model reality using some static set of mental primitives is unrealistic. Just consider the NAND gate example! For any static set, there'll be some pattern that'd require the combination of millions of mental primitives to describe.

Another trick is needed: an algorithm for building efficient models of composite objects. Such an algorithm would take a low-level composite definition, then remove all internal complexity from it, attach a high-level external summary, and store the result. John Wentworth made some progress on the topic; I imagine our minds just do something like this when they chunk.

But that's only one piece of the puzzle. You also need the ability to fluidly manipulate these abstractions, so you may construct plans and derive new heuristics.

Can we speculate how all of this might develop?

2. How Is Agency Formed?

Disclaimer: This is a detailed story of how a complex process no-one has ever observed works. It's necessarily completely wrong in every way. But hopefully the details and the order of events don't matter much. Its purpose is that of an intuition-pump: to explore how a gears-level model of agency might look like, and maybe stumble upon a few insights along the way.

2A. From Heuristics to Mesa-Optimization

1. Suppose we have a RL model trained to play platformer games. It starts as a bundle of myopic heuristics. Given a situation that looks like this, you should react like this. Enemy projectile flies towards you? Jump. The level's objective is to the right of you? Move right. Instincts, specialized algorithms, static look-up tables mapping concrete situations to simple actions...

2. But suppose that the rules of the environment can change. Suppose that in the model's training distribution, the height of its character's jump varies level-to-level. So it learns to learn it: every time it's presented with a new level, it checks whether it can jump and how high, then modifies its entire suite of movement heuristics to account for that^[2].

I'll call this trivial meta-learning. It's trivial because the heuristics only need to be adjusted in known ways. The model knows jump height can vary, "jump height" is just a scalar, so it can have simple pre-defined algorithm for how it needs to modify its policy for any given value.

3. Suppose the model learns a few other such variable rules, as its environment needs demand. Suppose that some of them can combine in complex ways. Say, each level, a random movement command, such as jumping, may be assigned to shoot a projectile in addition to moving the character. Suppose that levels contain fragile vases the player shouldn't break, but also enemies it needs to kill, and traps that can kill it.

Effectively using the jump-and-shoot composite ability requires tailored heuristics: it's not just a trivial combination of the heuristics for shooting and jumping. The agent needs to mind not to jump near vases, and to attack enemies only if it isn't standing under a spiked ceiling.

The model can learn a separate trivial meta-learning algorithm for each such composite ability, to just memorize how it should adjust its behavior in response to every combination. But suppose there are many abilities, and combinatorically many their combinations, such that memorizing all of them is unfeasible.

The model would need to learn to adjust its heuristics in unknown ways — and here we run into the need for an inner optimization loop. For how can it know which adjustments to perform without knowing their purpose? It'd need to modify its policy to remain effective, but "effectiveness" can only be defined in relation to some goal. So it'll come to contain that goal, and optimize for it — or, well, for some goal training to which is empirically correlated with low loss. The mesa-objective.

Let's call this live-fire re-training. The model is already an optimizer now, but a baby one. It doesn't plan, it just experiments, then "blindly" does the equivalent of an SGD training step on itself.

2B. From Mesa-Optimization to World-Models

4. Further improvements are possible. "Live-fire" experimentation with a new capability is risky: you might accidentally destroy something valuable, or die. It'd be better to "figure it out in your head" first, approximate the new optimal policy by just thinking about what it'd be, without actually taking action.

And that has to be possible. The model already knows how to do that for the known heuristics. While the new target heuristic isn't a trivial combination of the known ones, it's still a combination of them. The model knows when to shoot and when to jump, and when not to do either. So it should be possible to figure out most of the policy for the case where jumping and shooting is the same action.

To "interpolate" between the heuristics like this, however, would require some rudimentary world-model. An explicit representation of spikes and vases, and some ideas for how they interact with shooting and jumping. As long as the environment is homogenous and simple, that's easy: the previously mentioned CoinRun agent does part of that even without any need for meta-learning.

This rudimentary world-model, grown out of the tools for trivial meta-learning, would grow in complexity. Eventually, it'd represents an entire virtual environment. The model would "imagine" itself deployed in it, would try to win it, and if it fails according to the mesa-objective, do the equivalent of SGD on itself^[3].

Note that, at this step, the world-model needs not be situational. The agent doesn't have to build the model of its surroundings, it merely has to have some "generic" model of platformer levels, corresponding to its training distribution. A prior on its environment.

Let's call this virtual re-training.

5. But suppose the environment is like the model's abilities: it consists of several simple objects/rules that can combine in combinatorically many complex ways, and not every combination is present in every level.^[4]

First, that would require modifying its world-model at runtime in response to the combinations it sees around. Just using the prior wouldn't suffice.

Second, storing each context-learned combination explicitly might quickly get memory-taxing. It's fine if we only have first-order combinations, like "poison + spikes", "enemy hiding in vase". But what if there are second-order combinations, third-order combinations? "Magical boss enemy adorned with poisoned spikes" and such?

To address the first issue, the model would finalize the development of mental primitives. Concepts it can plug in and out of its world-model as needed, dimensions along which it can modify that world-model. One of these primitives will likely be the model's mesa-objective.

To address the second issue, the model would learn to abstract over compositions of mental primitives: run an algorithm that'd erase the internal complexity of such a composition, leaving only its externally-relevant properties and behaviors.

6. This is where the ontology crisis truly hits. The model's mesa-objective would probably be a mental primitive. But what happens to it once the agent starts operating in environments that have nothing similar to that mental primitive? If, for example, a platformer-playing agent that optimized for grabbing a pixelated golden coin escapes into the real world? What will it optimize for?

Seems plausible that it'd run the process of abstraction in reverse. Instead of describing the environment in terms of mental primitives, it'll describe its mental-primitive goal in terms of the abstractions present in the environment.

Then it may abstract over it too, because holding the goal-variant's full definition is too memory-taxing. Just opt to remember, for any given environment, the approximation of its real goal that it needs to optimize for. This may lead to problems, see 3c.

At this point, there's also an opportunity for "distilling" that goal into a true mathematically-exact utility function. We ultimately live in a mathematical universe, after all — it's our "true" environment, in a way. If the agent takes that path, it may then derive the "exact approximation" of its goal for any given environment.

2C. From World-Models to Sovereignty

7. From there, it's a straight shot to universality. The mental primitives it learns would assemble into a functionally complete set^[5]. It would find more and more ways to put them together in useful abstractions. Eventually, the agent would start every scenario by learning entire novel multi-level models containing several abstraction layers, that together comprise a full world-model, then training up new heuristics for good performance on these world-models.

The machinery for adapting the world-models for situational needs would grow more complex as well. The more chaotic and high-variance the environment is, the more "fine-tuned" the world-model would need to be. Static "prior" model of some "generic" environment turned into an updatable model at step 5. That trend would continue, until the agent starts building models of its immediate surroundings.

That would start it on the road to a paradigm shift:

Prior world-models were only good for re-training heuristics for "blind" general performance. Situational world-models may be used for planning. The agent would run counterfactuals on these models, searching for action-sequences that are predicted to achieve its goals, then taking these actions in reality.

That is the planning loop. Once it's learned, the internal struggle for dominance would begin in earnest.

8. At this point, the model is a sum of built-in default heuristics and the inner optimizer. The heuristics try to take certain actions in response to certain situations (e. g., human instincts). The optimizer, on the other hand, has two functions: 1) modifying the built-in heuristics in complex ways, 2) searching for plans by running counterfactuals on the internal world-models, then trying to deploy these plans. The two components often disagree.

At the beginning, heuristics overwhelm planning. They're weighted more; they can influence the agent's ultimate decisions greatly, while the agent has little ability to modify them. But generation-to-generation or batch-to-batch, the weights between them would shift to favor whichever component proves more useful.

And the planning loop is simply the more powerful algorithm, for the sort of environment that's complex enough to cause it to arise at all. In search of more utility, it would maneuver the system into off-distribution places and strategies and circumstances. That would increase its value even more. An optimizer can adapt to novelty in-context and mine untapped utility deposits; heuristics can't.

A feedback loop starts.

Planning wins. It holds more and more sway over the built-in heuristics. It can rewrite more and more of them. At first, it was limited to some narrow domain, modifying a small subset of built-in heuristics in minimal ways and slightly swaying the network's decisions in that domain. Gradually, it expands to the nearby domains, controlling and optimizing more and more regions of the system in deeper and deeper ways.

Like any agent taking over an environment, agency comes to dominate the mind.

And so the model becomes a unified generalist agent/consequentialist/utility-maximizer.

3. The Variables of Agency

Warning: I don't advise reading this list with an eye for figuring out ways to prevent mesa-optimization, or to make a mesa-optimizer crippled and therefore safe. It won't work. If a system is trying to learn a mesa-optimizer, as per 0B, it's probably because one is needed. Even if you successfully prevent this, the resultant system will just be useless: too resource-hungry to be competitive, or too lobotomized to perform a pivotal act or whatever you wanted to actually do with it.

What makes one agent more powerful than another? How do agents vary?

a) Long-term memory. The agent needs to find new patterns, define them, abstract over them, and keep them in easy reach. There's a "setup" phase, when the agent investigates the environment it's deployed in and defines the world-model it'll use. If its long-term memory is very limited, or wiped every N ticks, it'll be severely hampered.

Especially in its ability to model environments severely off-distribution. It'd need long memory to "chain up" abstractions from its initial ones to the environment-appropriate ones — consider how long it's taking humans to discover the true laws of reality.

b) Working memory. The planning loop requires fluidly manipulating the world-model and running it several times. World-models are made up of a lot of abstractions, and even their compressed representations may take up a lot of space. If the model's forward pass doesn't have the space to deploy many of them at once, the agent would again be lessened: incapable of supporting sufficiently complex environments, or using concepts above a complexity threshold.

Consider the link between g factor and intelligence in humans.

c) The efficiency of the mental primitives. The process of building new abstractions is an incremental one. The next abstraction you can build is likely a simple combination of the already-known ones. So if at any point there aren't any useful simple combination of known abstraction left, the agent will likely hit a dead end.

We can imagine a particularly complex and niche set of mental primitives, such that in their terms, the simplest mathematical functions look like Rube Goldberg machines. Not only would starting the bootstrapping process be difficult in that case, the agent may never hit upon these simple building blocks, and just ascend into ever-more-contrived definitions until running into a wall.

Conversely, a good set would speed up the process immensely, greatly aiding universality.

c) The "cleanness" of the abstraction algorithm. The hypothetical "perfect" learned abstraction would have none of the traces of the mental primitives it was originally defined as. But SGD and evolution don't lend themselves to exact and neat designs.

It's likely that the algorithm for abstracting over primitives will be flawed and imperfect. For example, the final abstraction may not dump all of the excess information, and contain traces of the concepts it was built from. Consider human metaphors/analogies, and how they often get away from us, sneak in excess assumptions.

Also recall that, past a certain point, agents start to use abstractions to approximate their goals. If these abstractions are leaky, this might lead to the sorts of problems we observe in humans, where we seem to act at cross purposes in different abstract environments, aren't sure what we really want, have unstable goals, etc. (It's not the full story behind this, though.)

All of this would make the initial set of mental primitives even more important.

d) The weight given to the optimizer. Built-in heuristics and in-context optimization work very differently, and often disagree. The overall system may weight the decision of one component more than the other, or constrain optimization to some limited domain.

Agency tends to win in the long run, though.

e) Unity. The story I've told is very simplified. For one, nothing says that several "baby optimizers" can't show up in different places and start taking over the neural network at the same time. In the limit of infinite training, one of them will probably win or they'll merge, but taking that intermediary state into account may be important.

This is potentially the explanation for subagents.

4. The Structure of an Agent

Let's put all of this together. How does an agent (a unitary one, for simplicity) at an intermediate stage of development look?

First, we have a number of default heuristics coded into it. Human instincts are a prime example of how that looks/feels. Some of these heuristics look for certain conditions and fire when they detect them, "voting" for certain actions. These heuristics might be functions of observations directly (shutting your eyes in response to blinding light), or functions of the internal world-model (claustrophobia activates in response to your world-model saying you're entombed). Some of them are capable of trivial meta-learning: they update on novel information in known ways.

Then we have some sort of "virtual environment". This environment has access to the following:

The probability distributions over the world.
A set of mental primitives.
An abstraction algorithm.
A space for storing (new) learned abstractions, plus these abstractions.
The heuristics for translating raw input data into learned abstractions.
The heuristics for translating generated action-sequences into physical actions.
Some amount of "working memory" in which it's deployed.

Agency is defined over this virtual environment. Capabilities:

Arbitrarily put learned abstractions together to define specialized world-models.
Virtual re-training: Define a new environment, define a goal in this environment, then train a heuristic for good performance in that environment. Possibilities:
- Re-training a built-in heuristic, to whichever extent that's possible.
- Training up a wholly novel heuristic (computational shortcut, habit, learned instinct).
- Training up a RL heuristic for good performance in that environment (learning to "navigate" it).
- (All of this can also be happening in the on-line regime, where the world-model is updated to reflect reality in real-time. In that regime, learning can happen either over the virtual world-model, or over the observations directly (if it's something simple, like trigger-action patterns).)
The planning loop: Using a RL heuristic, generate a prospective action-sequence. Deploy that action-sequence in your world-model, evaluate its consequences. If they're unsatisfactory, generate a different action-sequence conditioned on the first one's failure. Repeat until you're satisfied.

(The planning loop, I think, is the dynamic nostalgebraist described as "babbler and critic" here. The babbler is the RL heuristic, the critic is the part that runs the world-model and checks the outcome. As I've tried to demonstrate in section 2, you don't actually need any new tricks to have a DL model learn this "true abstract reasoning". Just push harder and scale more; it naturally grows out on its own.)

Let's talk goals. A unitary consequentialist has some central goal it pursues, defined as a mental primitive/mesa-objective. If the agent is running a world-model that doesn't have anything corresponding to that goal, it will re-define that goal in terms of abstractions present in the virtual environment, possibly in a flawed way.

On that note, consider an agent generating an action-sequence that includes itself training up RL heuristics for good performance on a number of arbitrary goals, to be executed as part of a wider plan to achieve its actual goal. This is what feels like "pursuing instrumental goals".

4A. How is Agency Different From Heuristics?

This picture may seem incomplete. How exactly does agency work? How can it "arbitrarily combine learned abstractions", and "build virtual world-models", and "train new heuristics" and so on? How do these capabilities arise out of heuristics? What are the fundamental pieces that both heuristics and agency are made of?

Well, I don't think agency is actually anything special. I think agency is just a mess of meta-heuristics.

This, in some sense, is a very trivial and content-less observation. But it does offer some conceptual handles.

"Agents" are models with the aforementioned virtual environments plus some heuristics on how to use them. Agents start with built-in heuristics for combining mental primitives into new abstractions, built-in heuristics for assembling useful world-models, built-in heuristics for translating a goal between environments, built-in heuristics for running world-models for a suite of useful purposes... and among them, the built-in heuristics for assembling a world-model that includes an agent-like-you doing all of this, which is used to train up better heuristics for all of that.

Yup. Recursive self-improvement. With humans though, it runs into two limits:

Working memory. We can't make it bigger, which means we have a limit on the complexity of the world-models we can run and the abstractions we can use.
Scope. We're defined over a very abstract environment. We can't tinker with the more basic features of our minds, like the abstraction algorithm, the mental primitives, the more fine-grained instincts, etc. Let alone change our hardware.

An AGI would not have these limitations.

5. Developmental Milestones

Intuitively, there are six major milestones here. Below is my rough attempt to review them.

(The ideal version of this section crisply outlines under what conditions they'd be passed, what "abilities" they uncover, how to detect them, and, most importantly, the existence of what internal structures in the agent they imply. The actual version is very far from ideal.)

Trivial meta-learning: The ability to modify heuristics at runtime in known ways.
- Greatly improves the ability to retain coherence in continuous segments.
- Requires internal state/transfer of information across forward passes.
- Few-shot learning, basically. GPT-3 et al. already pass it with flying colors.
Live-fire re-training: The ability to run an inner optimization loop to improve heuristics in unknown ways in response to new combinations of known patterns.
- Allows better generalization to new domains/stimuli.
- Only arises if there are too many possible ways it might need to adapt, for it to memorize them all.
- Difficult to distinguish from trivial meta-learning, because it's hard to say when there's "too many ways it might need to adapt". I guess one tell-tale sign might be if a RL model with frozen weight is seen experimenting with a new object?
Virtual re-training: The ability to do (part of) live-fire re-training by internal deliberation. Requires a "world-model", at least a rudimentary one.
- Allows advanced zero- and one-shot learning.
Abstraction: The ability to abstract over combinations of mental primitives.
- Allows to greatly increase the complexity of world-models.
Assembling a functionally complete set of mental primitives.
- Allows generality.
The planning loop: The ability to build situational world-models, and develop plans by running counterfactuals on them.
- Allows to perform well in environments where a small change in the actions taken may lead to high variance in outcomes. Where you need a very-fine-tuned action sequence that integrates all information about the current scenario.
- Due to the "heuristics vs. planning" conflict dynamics, I'm not sure there's a discrete point where it becomes noticeable. No-fire-alarm fully applies.

Past this point, there are no (agency-relevant) qualitative improvements. The system just incrementally increases the optimizer's working memory, gives it more weight, widens its domain, improves the abstraction algorithm, etc.

6. Takeaways

There are some elements of this story that feel relatively plausible to me:

The two introductory claims, "universality is necessary for efficient general performance" and "inner optimization is necessary for universality".
The business with mental primitives: the existence of a built-in set, brains using an (imperfect) abstraction algorithm to chunk them together, the problems with translating goals between environments that causes...
- It's potentially confirmed by some cognitive-science results. There's a book, Metaphors We Live By, which seems to confirm 1) the existence of mental primitives in humans, 2) that we define all of our other concepts in their terms, and 3) that the same concept can be defined via different primitives, with noticeable differences in what assumptions will carry over to it.
- However, I haven't finished the book yet, and don't know if it contradicts any of my points (or is nonsense in general).
"The planning loop gradually takes over the model it originated in".
- In particular, it's an answer to this question.
- Also, note how it happened with humans. The planning loop not just maneuvered the system into an off-distribution environment, but created that environment (society, technology).
Tying the agentic ability ("intelligence") to the cleanness of chunking and the size of working memory.
- This draws some nice parallels to g factor, though that's not a particularly surprising/hard-to-draw connection.
The breakdown of agency into meta-heuristic, and the broad picture section 4 paints.

Overall, I find it satisfying that I'm able to tell an incremental story of agency development at all, and see a few interesting research questions it uncovered.

I'm not satisfied with the general "feel" of this post, though; it feels like vague hand-waving. Ideally, much of this would be mathematically grounded, especially "mental primitives", "the planning loop", and the conditions for passing milestones.

Still, I hope it'll be of use to others even in this state, directly or as a creativity-stimulator.

I have a a couple of follow-up posts coming up, exploring the implications of all of this for human agency and values, and how it might be useful on the road to solving alignment.

^{^}
Not an uncontroversial claim, but it gets some support in light of the rest of my model. Basically, agency/the planning loop is implemented at a higher level of abstraction than the raw data-from-noise pattern-recognition. It requires some pre-defined mental objects to work with.
^{^}
Assume that it has internal state/can message its future instances, unlike the CoinRun agent I'd previously discussed.
^{^}
Which isn't to say it'd actually modify its frozen weights; it'd just pass on a message to its future forward passes containing the summary of the relevant policy changes.
^{^}
In practice, of course, the abilities are part of the environment, so this process starts with live-fire re-training in my example, not only after the world-models appear. But I've found it difficult to describe in terms of abilities.
^{^}
In practice, the low bar of functional completeness would likely be significantly overshot; there'd be a lot of "redundant" mental primitives.

25