Relative Abstracted Agency

Audere

Note: This post was pasted without much editing or work put into formatting. I may come back and make it more presentable at a later date, but the concepts should still hold.

Relative abstracted agency is a framework for considering the extent to which a modeler models a target as an agent, what factors lead a modeler to model a target as an agent, and what sort of models have the nature of being agent-models. The relative abstracted agency of a target relative to a reasonably efficient modeler is based on the most effective strategies that the modeler uses to model the target, which exist on a spectrum from terminalizing strategies to simulating strategies.

Terminalizing
- Relies most heavily on models of the target’s goals or utility function, and weighs future outcomes heavily based on how they might score in the target’s utility function. The modeler might ask, “what outcomes rank highly in the target’s utility function?” and use its approximations of the answer to predict future outcomes without predicting much in the way of particular actions, strategies, or paths that the target might take, except possibly to come up with lower bounds on rankings of outcomes in the target’s preferences.
- Examples: the efficient market hypothesis, a savvy amateur predicting the outcome of a chess game against Stockfish or Magnus Carlsen, humans predicting what the world might look like 1 year after a superintelligent paperclip maximizer is created
Projecting
- Combines models of the target’s goals or utility function with the modeler’s ability to find actions or strategies to achieve goals. The modeler might ask, “what would be the best actions or strategies to take if I had the target’s goals and resources?” and uses its approximations of the answer to predict and model the target.
- Examples: a competent chess player playing chess against a level of stockfish that they have a small but non-negligible chance of defeating, large portions of game theory and decision theory, AlphaChess training itself using self-play
Psychologizing
- Combines models of more specific aspects of the target’s processes, tendencies, flaws, weaknesses, or strengths with the modeler’s ability to find actions or strategies to achieve goals. The modeler might ask, “what would be the best actions or strategies to take if I had the target’s goals or motivations?”, and also ask “in what ways are the target’s strategies and tendencies likely to differ from mine and those of an ideal agent, and how can these help me predict the target?” It is at this level that concepts like yomi are most in play.
- Examples: rhetorical persuasion, Gary Kasparov offering material to steer Deep Blue into a positional game based on a model of Deep Blue as good at calculation and tactics but weak at strategy and positioning, humans modeling LLMs.
Mechanizing
- Relies most heavily on models or approximations of processes specific to the target rather than using much of the modeler’s own ability to find actions or strategies to achieve goals. Does not particularly model the target as an agent, but as a process similar to an unagentic machine; considers predicting the target as an engineering problem rather than a psychological, strategic, or game-theoretic one.
- Examples: humans modeling the behavior of single-celled organisms, humans modeling the behavior of jellyfish, humans setting up a fungus to solve a maze
Simulating
- Simulates the target with high fidelity rather than approximating it or using heuristics. What Omega, Solomonoff Induction, and AIXI do to everything.
- Examples: Omega, Solomonoff Induction, AIXI

Factors that affect relative abstracted agency of a target:

Complexity of target and quantity of available information about the target. AIXI can get away with simulating everything even with a mere single bit of information about a target because it has infinite computing power. In practice, any modeler that fits in the universe likely needs a significant fraction of the bits of complexity of a nontrivial target to model it well using simulating strategies. Having less information about a target tends to, but doesn’t always, make agent-abstracted strategies more effective than less agent-abstracted ones. For example, a modeler may best predict a target using mechanizing strategies until it has information suggesting that the target acts agentically enough to be better modeled using psychologizing or projecting strategies.
Predictive or strategic ability of the modeler relative to the target. Targets with overwhelming predictive superiority over a modeler are usually best modeled using terminalizing strategies, whereas targets that a modeler has an overwhelming predictive advantage over are usually best modeled using mechanizing or simulating strategies.

Relevance of this framework to AI alignment:

We would prefer that AI agents not model humans using high-fidelity simulating or mechanizing strategies, both because such computations could create moral patients, and because an AI using simulating or mechanizing strategies to model humans has potential to “hack” us or manipulate us with overwhelming efficacy.
Regarding inner alignment, subcomponents of an agent which the agent can model well using simulating or mechanizing strategies are unlikely to become dangerously inner-misaligned in a way that the agent cannot prevent or fix. it may be possible to construct an agent structured such that each superagent has some kind of verifiable “RAA superiority” over its subagents, such that it is impossible or unlikely for subagents to become dangerously misaligned with respect to their superagents.
Regarding embedded agency, an obstacle to amending theoretical agents like AIXI to act more like properly embedded agents is that they are heavily reliant on simulating strategies, but cannot use these strategies to simulate themselves. If we can formalize strategies beyond simulating, this could provide an angle for better formalizations of self-modeling. Human self-modeling tends to occur around the psychologizing level.

Additional thoughts:
(draws on parts of https://carado.moe/predca.html, particularly Kosoy’s model of agenthood)

Suppose there is a correct hypothesis for the world in the form of a non-halting turing program. Hereafter I’ll simply refer to this as “the world.”

Consider a set of bits of the program at one point in its execution which I will call the target. This set of bits can also be interpreted as a cartesian boundary around an agent executing some policy in Vanessa’s framework. We would like to evaluate the degree to which the target is usefully-approximated as an agent, relative to some agent that (instrumentally or terminally) attempts to make accurate predictions under computational constraints using partial information about the world, which we will call the modeler.

Vanessa Kosoy’s framework outlines a way of evaluating the probability that an agent G has a utility function U which takes into account the agent’s efficacy at satisfying U as well as the complexity of U. Consider some utility function which the target is most kosoy-agentic with respect to. Hereafter I’ll simply refer to this as the target’s utility function.

Suppose the modeler can choose between gaining 1 bit of information of its choice about the target’s physical state in the world, and gaining 1 bit of information of its choice about the target’s utility function. (Effectively, the modeler can choose between obtaining an accurate answer to a binary question about the target’s physical state, and obtaining an accurate answer to a binary question about the target’s utility function). The modeler, as an agent, should assign some positive amount of utility to each option relative to a null option of gaining no additional information. Let’s call the amount of utility it assigns to the former option SIM and the amount it assigns to the latter option TERM.

A measure of the relative abstracted agency of the target, relative to the modeler, is given by TERM/SIM. Small values indicate that the target has little relative abstracted agency, while large values indicate that the target has significant abstracted agency. The RAA of a rock relative to myself should be less than one, as I expect information about its physical state to be more useful to me than information about its most likely utility function. On the other hand, the RAA of an artificial superintelligence relative to myself should be greater than one, as I expect information about its utility function to be more useful to me than information about its physical state.

I don't mind the post was posted without much editing or work put into formatting but I find it somewhat unfortunate the post was probably written without any work put into figuring out what other people wrote about the topic and what terminology they use

Recommended reading:
- Daniel Dennett's Intentional stance
- Grokking the intentional stance
- Agents and device review

@Audere Thoughts on changing words to match previous ones?

@mods, if there were an alignmentforum sketch grade posts, this would belong there. It seems like there ought to be a level between lesswrong and alignmentforum, which is gently vetted, but specifically allows low quality posts.

a question came up - how do you formalize this exactly? how do you separate questions about physical state from questions about utility functions? perhaps, audere says, could you bound the relative complexity of the perspectives of utility function representation vs simulating perspective?

also how do you deal with modeling smaller boundedly rational agents in actual formalism? I can recognize psychologizing is the right perspective to model a cat who is failing to walk around a glass wall to get the food on the other side and is instead meowing sadly at the wall, but how do I formalize it? Seems like the discovering agents paper still has a lot to tell us about how to do this - https://arxiv.org/pdf/2208.08345.pdf

Still on the call - Audere was saying this builds on Kosoy's definition by trying to patch a hole; I am not quite keeping track of which thing is being patched

We were discussing this on a call and I was like "this is very interesting and more folks on LW should consider this perspective". It came up after a while of working through Discovering Agents, which is a very deep and precise causal models read and takes a very specific perspective. The perspective in this post is an extension of

Agents and Devices: A Relative Definition of Agency
According to Dennett, the same system may be described using a physical' (mechanical) explanatory stance, or using an intentional' (belief- and goal-based) explanatory stance. Humans tend to find the physical stance more helpful for certain systems, such as planets orbiting a star, and the intentional stance for others, such as living animals. We define a formal counterpart of physical and intentional stances within computational theory: a description of a system as either a device, or an agent, with the key difference being that devices' are directly described in terms of an input-output mapping, while agents' are described in terms of the function they optimise. Bayes' rule can then be applied to calculate the subjective probability of a system being a device or an agent, based only on its behaviour. We illustrate this using the trajectories of an object in a toy grid-world domain.

One of the key points that @Audere is arguing that the amount of information one has about a target, and one needs to know more and more about a target possible agent to do higher and higher levels of precise modeling. Very interesting. So a key concern we have is the threat from an agent that is able to do full simulation of other agents. If we could become unpredictable to potentially scary agents, we would be safe, but due to being made of mechanisms we cannot hide, we cannot indefinitely.

@Audere Thoughts on changing words to match previous ones?

Still on the call - Audere was saying this builds on Kosoy's definition by trying to patch a hole; I am not quite keeping track of which thing is being patched

Agents and Devices: A Relative Definition of Agency
According to Dennett, the same system may be described using a physical' (mechanical) explanatory stance, or using an intentional' (belief- and goal-based) explanatory stance. Humans tend to find the physical stance more helpful for certain systems, such as planets orbiting a star, and the intentional stance for others, such as living animals. We define a formal counterpart of physical and intentional stances within computational theory: a description of a system as either a device, or an agent, with the key difference being that devices' are directly described in terms of an input-output mapping, while agents' are described in terms of the function they optimise. Bayes' rule can then be applied to calculate the subjective probability of a system being a device or an agent, based only on its behaviour. We illustrate this using the trajectories of an object in a toy grid-world domain.

14

Relative Abstracted Agency

14

14

Agents and Devices: A Relative Definition of Agency

14

Agents and Devices: A Relative Definition of Agency