Ramble on progressively constrained agent design

Iris of Rosebloom

Two problems with the following post: 1) Nobody lives by an explicit utility function, so very little of this line of thinking immediately useful. 2) I don't explore how an agent might deal with flaws in its input or output mechanisms at all; I just assume those aspects don't have flaws.

Let's say you're an agent with a utility function. You take a configuration of variables ("the target system") as an input, and output a number based on how well the configuration of the target system satisfies some mathematically specified set of values.

Like many agents with utility functions, you're programmed to try and maximize your utility function's output, which you do by altering the target system so it scores better by the utility function's rubric. Among such agents you happen to be lucky: You have all the algorithms and resources you need to satisfy this goal as thoroughly as physically possible. From the standpoint of the target system, you take less than a single time-step to:

Read and record all relevant information about the target system,
Compute which possible configuration of its variables would get the best score from your utility function, and
Overwrite the target system with that new, optimal configuration.

Any agent with these three features has what I call the Ideal Agent Architecture, because it's the most reliable way an agent could fulfill an arbitrary utility function given an arbitrary target system. For convenience, the rest of this post will refer these three types of ability as:

Input, or the ability to read and record from the target system;
Compute, or the ability to manipulate data internally; and
Output, or the ability to write to the target system.

Now, I think we can learn a lot about mind design in general if we ask ourselves a series of questions about how to approximate the Ideal Agent Architecture given various constraints. For example, how could we best approximate the IAA if our agent was bottlenecked on output, meaning it had no single action for re-writing the full target system into its utility-optimal configuration?^[1]

Assuming you have no advance knowledge of the target system, your best bet is still to employ your finite yet ever-sufficient compute to run three more general-purpose algorithms, to help you spend your limited outputs wisely:

An available action tracker: The ability to list all your possible actions on the target system;
A consequence modeler: The ability to accurately predict the effects any of your actions or strings of actions would have on the target system; and
A decision making algorithm: The ability to elect the action or string of actions which would result in the most utility-optimal configuration of the target system.

So, those are the three algorithms we should add to the IAA if we want it to deal well with limited output. Now let's take the next step: How could you best modify the IAA to deal with limited input, meaning it could only observe a fraction of the target system at a time?

Like before, what you need to do is tack on more clever compute-based algorithms, this time to extract as much insight from your limited observations as possible. You should make the three following additions:

A Solomonoff inductor: After each observation, run through all possible computer programs,^[2] and take note whenever you find one whose outputs contain matches for all your observations so far, as well as how long that program is. Each of these programs then becomes a hypothesis about the guiding laws of the target system; any program that's longer than the shortest one you found gets probabilistically discounted, proportional to its extra length.
A Solomonoff-integrated consequence modeler: After each run of the Solomonoff inductor, predict the effects each of your possible actions or strings of actions would have on each Solomonoff-provided variant of the target system.
An expected value calculator: After each round of consequence modeling, multiply the utility-score of the world you ended up with by the probability you assigned to the starting world. Store the result in a list with all the other probability-weighted scores that action has yielded; once you've done this for all possible actions on all possible worlds, elect the action whose scores yield the highest sum.

So, those are the algorithms we'd use to approximate the Ideal Agent Architecture when bottlenecked on both input and output. For reference, let's run through all of the algorithms our constrained agent is running so far:

An observation recorder (input)
Various intermediary algorithms (compute)
1. A utility function
2. A Solmonoff inductor
3. An available action tracker
4. A consequence modeler
5. An expected value calculator
An action mechanism (output)

This brings us to the fun part: What if you were also bottlenecked on compute, the resource you're using to run all those fancy intermediary algorithms in the first place?

Unlike the previous problems, the solution to this one actually varies by how much your compute is constrained; and unfortunately, at all particular levels of compute, nobody seems to have found an obviously optimal solution. However, we can at least isolate some promising regions of design-space, including at what I'll call high, moderate, and low levels of compute (this measure being relative to the size and speed of the target system).

When dealing with high ratios of agent compute : target system compute, you actually have the option to just reuse most of our previous agent's architecture. The only part you have to get rid of is the Solomonoff inductor, and replace it with something else that assigns "probabilities" (arbitrary numbers between zero and one) to "world-models" (arbitrary computer programs), but which doesn't run indefinitely.^[3]

On first pass, my recommendation for doing this efficiently was what I called Finitist Solomonoff Induction, which is where an agent looks through only a finite number of programs, and dynamically adjusts the exact number either up or down based on whether history suggests it will have the time and memory needed to finish working with all those programs in a single time-step. On second pass, I noticed that it might be helpful to generalize this principle, and use past performance to dynamically tighten or loosen some other self-imposed constraints, such as allowing oneself two time-steps to finish running one's algorithms, or deciding that your model program only needs to explain most of your observations so far. (Such things would impair your agent, but they wouldn't break it.)

There are two other potential avenues for increased efficiency I can think of. On one hand, it seems plausible that compression algorithms could be written for making the agent's actions and world-models less unwieldy to manipulate, without losing so much data or compute as to have negative expected utility. (In every-day language we call this abstraction). On the other hand, maybe one could write an algorithm to help the agent prioritize looking at world-models or actions that are likely to be actually useful, rather than searching all possible actions and world models by bruteforce. However, I have no idea how to implement either of these by hand.

One day I might try to suss out the details anyway, building variants on these agents and running them over small, artificial target systems, and tinkering with them to optimize performance. However, for EMH reasons, I expect it's extremely hard to actually build an agent with comparable-to-human intelligence via this method; instead, the AI industry seems to have collectively thrown its hands up and decided to defer most of the optimization work to a more hands-off architecture: machine learning.

At first glance, the algorithms that machine learning runs look totally arcane; here's a summary if you're not familiar^[4]. However, the remarkable thing is that in the course of running their opaque, intractable algorithms, ML-based agents will tend to emergently develop analogues to all of the traits of a utility-based agent anyway. The analogies play out like this:

If an ML agent gets reinforcement based on how well its outputs fulfill a utility function, then as the agent is trained on a broader and broader variety of situations, it will tend toward acting as though it has that utility function, even if it doesn't contain or obey an explicit internal representation thereof.^[5]
If the agent's reinforcement function incentivizes correct predictions about the world, this tends to emergently produce an analogue to a Solomonoff inductor; for instance, LLMs predict next token, and they end up with vaguely probabilistic world-models as a result. (Arguably, humans do the same thing with incoming sensory experience).
Running ML will tend to instill an agent with shards, or contextually activated decision-making influences. Shards analogize to an available action tracker, insofar as they also cause the agent to consider possible actions, albethem particular ones in particular situations.
The process by which shards' action recommendations get weighed against each other analogizes to the expected value calculator; shards which have garnered more reinforcement in the past, or which have been contextually activated more strongly, will tend to win out over other shards, causing viable decisions to be made.

(The consequence modeler is best analogized to something at the intersection of shard networks and predictive processing, which I don't think we have an established buzzword for. Maybe someone should change this?)

In addition to these approximations of our compute-unbounded agent, though, an ML agent will also develop evermore optimal approximations of the efficiency suggestions I made for utility maxizers that are bounded on compute: Standards adjustment, abstraction, and culling of world-space/action-space. As I suggested in 3), for instance, the nature of shard structures results in agents tending only to consider actions which have been historically useful in a given situation. (ML agents also seem good at emergently creating efficient abstractions, although I haven't been able to come up with any decent guesses for how ML systems actually do this.)

Now, to begin segueing into the last architecture I want to talk about, I'd like to mention that there's also a significant manner in which ML agents are disadvantaged compared to paired-down utility maximizers, which I passingly brought up earlier: Machine learning-based AI does not by default contain an explicit internal representation of a utility function, nor would it necessarily adhere to one even if it had one in mind.

For the sake of this post, I'm going to brazenly ignore the question of how to get your agent to understand your true values at all, especially for those who can't use the outputs of an explicit utility function to give their ML systems consistent and comprehensive feedback. The more interesting problem for our purposes is this: The fact that your agent takes time to learn your values at all means it might fail to learn those values fast enough, and therefore make silly mistakes at critical junctures, losing out on lots of possible value.

(One can think of the above as two different facets of the alignment problem.)

Interstingly, the problem of ML agents learning too slowly is something evolution actually found a way to mitigate; although significant parts of the human mind seem to work on something akin to machine learning (see shard theory), other partitions work on the third and final architecture I'll be talking about in this post: the fixed stimulus response.

Fixed stimulus responses are basically when a subsystem's response to a pre-determined input is to impose a corresponding, hardcoded influence on the agent's upcoming outputs. For example, consider human breastfeeding. Rather than having to luck into breastfeeding at least once before getting that behavior reinforced and learning to repeat it, human babies are born with an innate desire to suck on things that look and feel like breasts, possibly to keep any babies from just failing to nourish themselves and starving to death. In natural language, we call fixed stimulus responses at this level of abstraction instincts; meanwhile, we call less abstract FSRs, such as retracting one's fingers from pricks, reflexes; and we call more abstract FSRs, such as becoming paralyzed with fear, emotional responses.

(I wonder if we can hardcode ML agents with FSRs somehow? Anthropic's new mech. interp. research might make giving them useful ones much easier. Maybe this is a good way to fight value drift...)

Anyway, because fixed stimulus responses can be lower on the hierarchy of abstraction than either utility-based agents or ML agents, I expect them to be relatively common among the outputs of optimization processes^[6], such as natural selection or human computer programming. That doesn't mean they don't have more general use-cases though, like what I've described above.

I think that's all I have to say for now On a completely different note, while I have your attention:

One thing I'm not actually confident in that I mentioned earliler is that Solomonoff induction is the best way of creating a calibrated predictive model of reality. Solomonoff is a formalization of one interpretation of Occam's Razor, where any added complexity makes a model striclty less probable, but my interpretation of Occam's Razor is that it was basically just a way of getting around the fact that there are infinite possible explanations for any given phenomenon; as a side effect, always going with the shortest explanation, rather than an explanation of another arbitrary length, also meant you didn't have to do as much work. (Though I suppose you could just choose some non-length-related criterion for that too, e.g. "whichever model I came up with first", which is even less work). Can anyone sell me on why more complicated explanations of a phenomenon are less likely in general, especially in a way as mathematically constant as would be implied by the claim that Solomonoff is maximally well-calibrated?

^{^}
An action here is defined as anything the agent can do to the target system within a single time-step from the standpoint of the target system.
^{^}
Yes, I know there are infinite possible computer programs, and this violates our "this agent is physically possible" maxim from earlier. We'll fix this in a second when we put bounds on our compute.
^{^}
"Probabilities" aren't natural categories; in this context, it's helpful not to think of these numbers as "attempts to approximate the true probability", but rather as "attempts to find numbers between zero and one that are useful for the purposes of utility maximization". Something similar goes for world-models.
^{^}
For those who don't know: Basically, machine learning involves feeding an agent data, which is then subjected to lots of initially randomized math, which then tells the computer what outputs to produce; these outputs are then evaluated according to some loss function, and depending on how the evaluation goes, the computer changes the intermediate math it'll do next time, such that it would give a better response if it recieved that same input again. For more details on the basics, I recommend the 3blue1brown videos on neural networks.
^{^}
Although, ML agents plausibly could develop explicit, internally represented sets of maximizer-like desires; I describe one such scenario for this in my post Shard Therapy, Prelude.
^{^}
"An optimization process is a collection of causal forces that have an outsized impact at the levels of abstraction we care about."