Idealized Agents Are Approximate Causal Mirrors (+ Radical Optimism on Agent Foundations)

[-]jacob_cannell2y132

I don't know who first said it, but the popular saying "Computer vision is the inverse of computer graphics" encompasses much of this viewpoint.

Computer graphics is the study/art of the approximation theory you mention and fairly well developed & understood in terms of how to best simulate worlds & observations in real-time from the perspective of an observer. But of course traditional graphics uses human-designed world models and algorithms.

Diffusion models provide a general framework for learning a generative model in the other direction - in part by inverting trained vision and noise models.

So naturally there is also diffusion planning which is an example of the symmetry you discuss: using general diffusion inference for planning. The graph dimensions end up being both space-time and abstraction level with the latter being more important: sensor inference moves up the abstraction/compression hierarchy, whereas planning/acting/generating moves down.

[-]Thane Ruthenis2y20

Yeah, I'd looked at computer graphics myself. I expect that field does have some generalizable lessons.

Great addition regarding diffusion planning.

[-]mfatt1yΩ5110

Just highlighting an overlap between the ideas expressed here and a stream that has recently been added to the MATS Summer 2024 Program.

This is not a direct extension of the work but something that shares some of intuitions and might help to formalise the ideas expressed in the post.

The grant proposal for the work is here. The proposal was submitted to Manifund (see here), where it was noticed by Ryan Kidd and subsequently added to the MATS program instead of receiving direct funding.

Do have a read and/or reach out if you're interested!

[-]quetzal_rainbow2y52

Note on the margins: I've never seen any analysis of the fact that embedded agent is made from the same matter with environment. When we are talking about abstract agents like AIXI, we have problem with the choice of UTM, which can yield an arbitrarily bad prior, but your environment sets restrictions on how actually your prior can be bad.

[-]RogerDearnaley2yΩ240

inverse problems tend to be difficult

Indeed, when cryptographers are trying to ensure that certain agents cannot do certain things, and other agents can, they often use trapdoor functions that are computationally impracticable for general agents to invert, but can be easily inverted by agents in possession of a specific secret.

I don't think there's a great deal that cryptography can teach agent fundamentals, but I do think there's some overlap: it should be possible to interface a valid agent fundamentals theory neatly to the basics of cryptography.

I'm fairly optimistic about arriving at a robust solution to alignment via agent-foundations research in a timely manner. (My semi-arbitrary deadline is 2030, and I expect to arrive at intermediate solid results by EOY 2025.)

I understanding that rigorously reexpressing philosophy in mathematics is non-trivial, but (as I'm sure you're aware) given currently plausible timelines, ~2030 seems pretty late for getting this figured out: we may well need to solve some rather specific and urgent practicalities by somewhere around then.

Can you tell me what is the hard part in formalizing the following:

Agent A (an AI) is less computationally limited than a set of agents through $H_{N}$ (humans). It models and can affect the world, itself, and the humans, using an efficient approximately Bayesian approach, and also models its own current remaining uncertainty due to insufficient knowledge (including due to not having access to the Universal prior since it is computationally bounded). It can plan both how to optimize the world for a specific goad while pessimizing with appropriate caution over its current uncertainty, and also how to prioritize using the scientific method to reduce its uncertainty. It understands (with some current uncertainty) what preference ordering the humans each have on future states of the world. It synthesizes all of these into a fairly good compromise (a problem extensively studied in economics and the theory of things like voting), then uses its superior computational capacity to optimize the world for this (with suitable minimizing caution over its remaining uncertainty) and also to reduce its uncertainty so it can optimize better.

Idealized Agents Are Approximate Causal Mirrors…
The literal formulation also runs into all sorts of infinite recursion paradoxes. What if the agent wants to model itself? What if the environment contains other agents? What if some of them are modeling this agent? And so on.

I recall reading a description by an early 20th century Asian-influenced-European-mystic of the image of a universe full of people being like a array of mirror-surfaced balls, each reflecting within it in miniature the entire rest of the array, including the reflections inside each of the other mirrored balls, recursively. (Though this image omits the agent modelling itself, it' not hard to extend it, say by adding some fuzz to the outside of each ball, and a reflection of that inside it,.)

[-]Thane Ruthenis2yΩ120

I don't think there's a great deal that cryptography can teach agent fundamentals, but I do think there's some overlap

Yup! Cryptography actually was the main thing I was thinking about there. And there's indeed some relation. For example, it appears that is because our universe's baseline "forward-pass functions" are just poorly suited for being composed into functions solving certain problems. The environment doesn't calculate those; all of those are in $P$ .

However, the inversion of the universe's forward passes can be NP-complete functions. Hence a lot of difficulties.

~2030 seems pretty late for getting this figured out: we may well need to solve some rather specific and urgent practicalities by somewhere around then

2030 is the target for having completed the "hire a horde of mathematicians and engineers and blow the problem wide open" step, to be clear. I don't expect the theoretical difficulties to take quite so long.

Can you tell me what is the hard part in formalizing the following:

Usually, the hard part is finding a way to connect abstract agency frameworks to reality. As in: here you have your framework, here's the Pile, now write some code to make them interface with each other.

Specifically in this case, the problems are:

an efficient approximately Bayesian approach

What approach specifically? The agent would need to take in the Pile, and regurgitate some efficient well-formatted hierarchical world-model over which it can do search. What's the algorithm for this?

It understands (with some current uncertainty) what preference ordering the humans each have

How do you make it not just understand that, but care about that? How do you interface with the world-model it learned, and point at what the humans care about?

[-]Noosphere892y*20

However, the inversion of the universe's forward passes can be NP-complete functions. Hence a lot of difficulties.

If were talking about cryptography specifically, we don't believe that the inversion of the universe's forward passes for cryptography is NP-complete, and if this was proved, this would collapse the polynomial hierarchy to the first level. The general view is that the polynomial hierarchy is likely to have an infinite amount of levels, ala Hilbert's hotel.

Yup! Cryptography actually was the main thing I was thinking about there. And there's indeed some relation. For example, it appears that NP≠P is because our universe's baseline "forward-pass functions" are just poorly suited for being composed into functions solving certain problems. The environment doesn't calculate those; all of those are in P.

A different story is that the following constraints potentially prevent us from solving NP-complete problems efficiently:

The first law of thermodynamics coming from time-symmetry of the universe's physical laws.
Light speed being finite, meaning there's only a finite amount of universe to build your computer.
Limits on memory and computational speed not letting us scale exponentially forever.
(Possibly) Time Travel and Quantum Gravity are inconsistent, or time travel/CTCs are impossible.

Edit: OTCs might also be impossible, where you can't travel in time but nevertheless have a wormhole, meaning wormholes might be impossible. .

[-]RogerDearnaley2yΩ110

However, the inversion of the universe's forward passes can be NP-complete functions.

Like a cryptographer, I'm not very concerned about worst-case complexity, only average-case complexity. We don't even generally need an exact inverse, normally just an approximation to some useful degree of accuracy. If I'm in a position to monitor and repeatedly apply corrections as I approach my goal, even fairly coarse approximations with some bounded error rate may well be enough. Some portions of the universe are pretty approximately-invertible in the average case using much lower computational resources than simulating the field-theoretical wave function of every fundamental particle. Others (for example non-linear systems after many Lyapunov times, carefully designed cryptosystems, or most chaotic cellular automata), less so. Animals including humans seem to be able to survive in the presence of a mixed situation where they can invert/steer some things but not others, basically by attempting to avoid situations where they need to do the impossible. AIs are going to face the same situation.

Hence a lot of difficulties.an efficient approximately Bayesian approach
What approach specifically? The agent would need to take in the Pile, and regurgitate some efficient well-formatted hierarchical world-model over which it can do search. What's the algorithm for this?

Basically every functional form of machine learning we know, including both SGD and in-context learning in sufficiently large LLMs, implements an approximate version of Bayesianism. I agree we need to engineer a specific implementation to build my proposal, but for mathematical analysis just the fact that it's a computationally-bounded approximation to Bayesianism takes us quite some way, until we need to analyze its limitations and inaccuracies.

It understands (with some current uncertainty) what preference ordering the humans each have
How do you make it not just understand that, but care about that? How do you interface with the world-model it learned, and point at what the humans care about?

I'm assuming a structure similar to a computationally-bounded version of AIXI, upgraded to do value learning rather than having a hard-coded utility function. It maintains and performs approximate Bayesian updates on an ensemble of theories about a) mappings from current world state + actions to distributions of future world states, and b) mappings from world states to something utility function-like for individual humans, plus an aggregate/compromise of these across all humans. It can apply the scientific method to reducing uncertainty on these both of these ensembles of theories, in a prioritized way, and its final goal is to meanwhile attempt to optimize the utility of the aggregate/compromise across all humans, in a suitably cautious/pessimizing way over uncertainties in a) and b). So like AIXI, it has an a explicit final goal slot by construction, and that goal slot has been pointed at value learning. You don't need to point at what humans care about in detail, that's part b) of its world model ensemble. You probably do need to point at a definition of what a human is, plus the fact that humans, as sentient biological organisms, are computational bounded agents who have preferences/goals (which your agent fundamentals program clearly could be helpful for, if Biology alone wasn't enough of a pointer).

Given access to an LLM, I don't believe finding a basically-unique best-fit mapping between the human linguistic world model encoded in the LLM and the AI's Bayesian ensemble of world models is a hard problem, so I don't consider something as basic as pointing at the biological species Homo sapiens is very hard. I'm actually very puzzled why (post GPT-3) anyone still considers the pointers problem to be a challenge: given two very large, very complex and easily queriable world models, there is clearly almost always (apart from statistically unlikely corner cases) going to be a functionally-unique solution to finding a closest fit between the two that makes as much as possible of one an approximate subset of the other. (And in those cases where there are a small number of plausible alternative fits, either globally or at least for small portions of the world-model networks, there should be a clear experimental way to distinguish between the alternative hypotheses, often just by asking some humans some questions.) This is basically just a problem in optimal approximate subset-isomorphism of labelled graphs (with an unknown label mapping), something that has excellent heuristic methods that work in the average case. (I expect the worst case is NP-complete, but we're not going to hit it.) Doing this between different generations of human scientific paradigms for the same subject matter is basically always trivial, other than for paradigms so primitive and mistaken as to have almost no valid content (even the Ancient Greek Earth-Air-Fire-Water model maps onto solid, gas, plasma, liquid: the four most common states of matter). There may of course be parts that don't fit together well due to mistakes on one side or the other, but the concepts "the species Homo sapiens" and "humans are evolved sentient animals, and thus computationally-bounded agents with preferences/goals" both seem to me to be rather unlikely to be one of them, given how genetically similar to each other we all are.

[-]Chris_Leong2yΩ240

One thing that would make this more symmetrical is if some errors in your world model are worse than others. This makes inference more like a utility function.

[-]Thane Ruthenis2yΩ120

Yup. I think this might route through utility as well, though. Observations are useful because they unlock bits of optimization, and bits related to different variables could unlock both different amounts of optimization capacity, and different amounts of goal-related optimization capacity. (It's not so bad to forget a single digit of someone's phone number; it's much worse if you forgot a single letter in the password to your password manager.)

[-]mattmacdermott2y*32

If we consider the relation between utility functions and probability distributions, it gets even more literal. An utility function over X could be viewed as a target probability distribution over X, and maximizing expected utility is equivalent to minimizing cross-entropy between this target distribution and the real distribution.

This view can be a bit misleading, since it makes it sound like EU-maxing is like minimising H(u,p): making the real distribution similar to the target distribution.

But actually it’s like minimising H(p,u): putting as much probability as possible on the mode of the target distribution.

(Although interestingly geometric EU-maximising is actually equivalent to minimising H(u,p)/making the real distribution similar to the target.)

EDIT: Last line is wrong, see below.

[-]Thane Ruthenis2y30

Although interestingly geometric EU-maximising is actually equivalent to minimising H(u,p)/making the real distribution similar to the target

Mind elaborating on that? I'd played around with geometric EU maximization, but haven't gotten a result this clean.

[-]mattmacdermott2y30

Sorry, on reflection I had that wrong.

When distributing probability over outcomes, both arithmetic and geometric maximisation want to put as much probability as possible on the highest payoff outcome. It's when distributing payoffs over outcomes (e.g. deciding what bets to make) that geometric maximisation wants to distribution-match them to your probabilities.

[-]Donald Hobson2yΩ120

You are making the structure of time into a fundamental part of your agent design, not a contingency of physics.

Let an aput be an input or an output. Let an policy be a subset of possible aputs. Some policies are physically valid.

Ie a policy must have the property that, for each input, there is a single output. If the computer is reversible, the policy must be a bijection from inputs to outputs. If the computer can create a contradiction internally, stopping the timeline, then a policy must be a map from inputs to at most one output.

If the agent is actually split into several pieces with lightspeed and bandwidth limits, then the policy mustn't use info it can't have.

But these physical details don't matter.

The agent has some set of physically valid policies, and it must pick one.

LESSWRONG
LW

LESSWRONG
LW

77

Idealized Agents Are Approximate Causal Mirrors (+ Radical Optimism on Agent Foundations)

77

Ω 31

77

Ω 31

Input Side: Observations

Output Side: Actions

Both Sides: A Causal Mirror

Why Is This Useful?

Missing Piece: Approximation Theory