Finding Goals in the World Model

JamesH; Thomas Larsen

Someone more versed in this line of research clue me in please: Conditional on us having developed the kind of deep understanding of neural networks and their training implicit in having "agentometers" and "operator recognition programs" and being able to point to specific representations of stuff in the AGIs' "world model" at all, why would we expect picking out the part of the model that corresponds to human preferences specifically to be hard and in need of precise mathematical treatment like this?

An agentometer is presumably a thing that finds stuff that looks like (some mathematically precise operationalisation of) bundles-of-preferences-that-are-being-optimised-on. If you have that, can't you just look through the set of things like this in the AI's world model that's active when it's say, talking to the operator, or looking at footage of the operator on camera, or anything else that's probably require thinking about the operator in some fashion, and point at the bundle of preferences that gets lit up by that?

Is the fear here that the AI may eventually stop thinking of the operator as a bundle-of-preferences-that-are-optimised-on, i.e. an "agent" at all, in favour of some galaxy brained superior representation only a superintelligence would come up with? Then I'd imagine your agentometer would stop working too, since it'd no longer recognise that representation as belonging to something agentic. So the formula for finding the operator utility function, which relies on the operator being in the set of stuff with high g your agentometer found, wouldn't work anymore either.

It kind of seems to me like all the secret sauce is in the agentometer part here. If that part works at all, to the point where it can even spit out complete agent policies for you to run and modify, like your formula seems to demand, it's hard for me to see why it wouldn't just be able to point you to the agent's preferences directly as well. The Great Eldritch Powers required for that seem, if anything, lesser to me.

[-]Jeremy Gillen3y31

I see this proposal as reducing the level of deep understanding of neural networks that would be required to have an "agentometer".

If we had a way of iterating over every "computation" in the world model, then in principle, we could use the definition of intelligence above to measure the intelligence of each computation, and filter out all the low intelligence ones. I think this covers most of the work required to identify the operator.

Working out how to iterate over every computation in the world model is the difficult part. We could try iterating over subnetworks of the world model, but it's not clear this would work. Maybe iterate over pairs of regions in activation space? Of course these are not practical spaces to search over, but once we know the right type signature to search for, we can probably speed up the search by developing heuristic guided methods.

An agentometer is presumably a thing that finds stuff that looks like (some mathematically precise operationalisation of) bundles-of-preferences-that-are-being-optimised-on. If you have that, can't you just look through the set of things like this in the AI's world model that's active when it's say, talking to the operator, or looking at footage of the operator on camera, or anything else that's probably require thinking about the operator in some fashion, and point at the bundle of preferences that gets lit up by that?

Yeah this is approximately how I think the "operator identification" would work.

Is the fear here that the AI may eventually stop thinking of the operator as a bundle-of-preferences-that-are-optimised-on, i.e. an "agent" at all, in favour of some galaxy brained superior representation only a superintelligence would come up with?

Yeah this is one of the fears. The point of the intelligence measuring equation for g is that it is supposed to work, even on galaxy brained world model ontologies. It only works by measuring competence of a computation at achieving goals, not by looking at the structure of the computation for "agentyness".

it can even spit out complete agent policies for you to run and modify

These can be computations that aren't every agenty, or don't match an agent at all, or only match part of an agent, so the part that spits out potential policies doesn't have to be very good. The g computation is used to find among these the ones that best match an agent.

[-]Charlie Steiner3y82

Nice post! You did an especially good job explaining equations - or at least, good enough for me to get what meant what :P

I also strongly agree with the claim that we should be thinking about aligning model-based reinforcement learning (or at least sorta-reinforcement-learning) agents.

If you read Reducing Goodhart you probably already know the rest of my take, but maybe I should write a simple post that just explains this one thing: we should model humans how they want to be modeled. Locating models of human-like objects in the world model that score highly according to agentiness and explanatory power is a great place to start your imagination, but it doesn't model humans how they want to be modeled^[1].

Modeling humans how they want to be modeled requires feeding information about inferred human preferences back into the model-of-humans selection process itself. It also means that there can be an important distinction between the AI's most accurate model of the world (best for planning), and the AI's most human-centric model of the world (best for conforming to human opinions about how our preferences should be modeled)^[2].

^{^}
The criteria that pick out models in this post (and its relatives, including PreDCA or the example in post 2 of Reducing Goodhart) are simple and tractable, but they're not what I would pick if I had lots of time to interact with this AI, look at how it ends up modeling me, and build tools to help me target it at something I think really "gets me."
^{^}
Or you could frame this a different way and figure out how to have the human-preferred structure "live inside" the most accurate model of the world.

[-]Jeremy Gillen3y31

I love the idea of modeling humans how they want to be modeled. I think of this as like a fuzzy pointer to human values, that sharpens itself? But I'm confused about how to implement this, or formalize this process.

I hadn't seen your sequence, I'm a couple of posts in, it's great so far. Does it go into formalizing the process you describe?

[-]Charlie Steiner3y41

Does it go into formalizing the process you describe?

Nope, sorry! I'm still at the stage of understanding where formalizing it would mean leaving in a bunch of parameters that hide hard problems (E.g. "a measure of how agent-shaped a model augmented with a rule for extracting preferences is" or "a function that compares plans of action in different ontologies."), so I didn't really bother.

But if you're around Lightcone, hit me up and we can chat and write things on whiteboards.

[-]Ramana Kumar3y60

Given a utility function ...

I might have missed it, but where do you get this utility function from ultimately? It looked like you were trying to simultaneously infer the policy and utility function of the operator. This sounds like it might run afoul of Armstrong's work, which shows that you can't be sure to split out the $U$ correctly from the policy when doing IRL (with potentially imperfect agents, like humans) without more assumptions than a simplicity prior.

[-]Jeremy Gillen3y30

That's correct that it simultaneously infers the policy and utility function. To avoid the underspecification problem, it uses a prior that favors higher intelligence agents. This is similar to taking assumptions 1 and 2a from http://proceedings.mlr.press/v97/shah19a/shah19a.pdf

[-]simon3y20

This architecture is quite similar (but much more fleshed out/detailed) to what I've been proposing in, e.g., "one possible alternative" towards the end of this comment.

One potential issue is that the AI needs to be choosing actions based on

a) the true utility function of humans which to the best of its current knowledge is represented by distribution X, but conditional on some other information A would be represented by some other distribution Y

and not based simply on

b) utility function distribution X which represents its current best knowledge of the true utility function of humans.

For example, imagine if the AI is deciding whether to wirehead all humans, so thoroughly that they will all remain in a comatose state until death. The AI thinks that there is a 60% chance humans would want this, and that conditional on wanting it they will want it by 100 utilons and conditional on not wanting it they will disapprove by -100 utilons.

The AI considers three actions:

1) wirehead the humans, expected value assessed as 20 utilons under either (a) or (b)

2) not wirehead the humans, expected value assessed as -20 utilons under either (a) or (b)

3) ask the humans at zero cost what they want, and do what they want, expected value assessed as (under a) 60 utilons or (under b) 4 utilons.

So, under (a), the AI will correctly ask humans what they want, and under (b) will self-modify to disable the utiliscope (or otherwise prevent the utility function from being updated) and then wirehead all humans in the face of our pleas to the contrary.

I think that your architecture with the utility function separated from the world-model results in (b), unless there's some mechanism to achieve (a) I'm not aware of. In my version, I am hoping that making the utiliscope be a pointer to a concept in the world model identified using interpretability tools, rather than extracting a utility function which is handled separately, might be able to achieve (a).

I also agree with Charlie Steiner's comment.

^{^}

Simulacra refers to a framing of the behaviour of large language models, in which they use simulated agentic behaviour with simulated goals to maximize predictive accuracy (e.g. when prompting GPT-3 to write a poem in the style of Robert Frost, it deploys a 'Robert Frost simulacra').

^{^}

See for comparison Steve Byrnes' model, Yann LeCun's 'A path towards Autonomous Machine Intelligence.'

^{^}

In practice, we probably won't want to do this, and instead have a set of operators to point at, and then follow some aggregation of their utility functions.

^{^}

Again we should emphasize that this is heavily influenced by Vanessa’s IBP and PreDCA, and then changed by us into a different, much less rigorous proposal which lacks a lot of the important formalization Vanessa uses. We think this is worth doing because we are better able to analyze, understand and criticize our version.

^{^}

Approximately equivalent to taking assumptions 1 and 2a from On the Feasibility of Learning Biases for Reward Inference.

^{^}

From Definition 1.6 in IBP.

^{^}

One way of justifying these equations is to think of the whole "Utiliscope" as doing inference on a Bayes Net.

If we assume that we have a distribution over operator policies, and we know that the operator is more likely to have higher intelligence $g$ , we can infer a distribution over the utility function of the operator. This can be thought of as inference on the following Bayesian network:

Each node is a random variable, where red means we have evidence about this variable, and blue is for latent variables. Each arrow is a functional relationship that we have prior knowledge about. We need to infer U, given our information about WM and $g^{'}$ .

$g (π, U)$ is a deterministic function of $π$ and $U$ , and is the intelligence of the agent. We assume that we have a noisy observation about $g$ , called $g^{'}$ .

Writing out the equations for inferring the distribution over $U$ , we get:

$P (U | WM, g^{'}) \propto \int_{g} \int_{π} P (g | π, U) P (g^{'} | g) P (U) P (π | WM)$

$P (U | WM, g^{'}) \propto \int_{π} P (g^{'} | π, U) P (U) P (π | WM)$

If we assume that the likelihood of $g^{'}$ is exponential in the level of intelligence (strong prior knowledge that the operator has high $g$ ), we get:

$P (g^{'} | π, U) \propto 2^{- log P_{π^{'} \sim ξ} [E_{e \sim ζ} (U (π, e)) \leq E_{e \sim ζ} (U (π^{'}, e))]}$

and the prior over utility functions is a Solomonoff prior, then this becomes:

$\begin{matrix} P (U | WM, g^{'}) & \propto E_{π | WM} [2^{- log P_{π^{'} \sim ξ} [E_{e \sim ζ} (U (π, e)) \leq E_{e \sim ζ} (U (π^{'}, e))]} \times 2^{- K (U)}] \propto E_{π | WM} [\frac{2^{- K (U)}}{P_{π^{'} \sim ξ} [E_{e \sim ζ} (U (π, e)) \leq E_{e \sim ζ} (U (π^{'}, e))]}] \end{matrix}$

Realistically we would probably want the likelihood $P (g^{'} | π, U)$ to increase fast for higher $g$ , but decrease after a certain point, to encode the knowledge that humans are intelligent agents, but not perfect agents.

^{^}

Specifically, the agent's beliefs take the form of a homogeneous ultra-contribution over $Γ \times Φ$ , your joint belief over computations and physics. The bridge transform, essentially, just checks if the agent's knowledge is consistent with a computation outputting something else (given a specific hypothesis over the universe). If it is consistent for the computation to output something else, then the computation didn't run, otherwise it did. See Infra-Bayesian Physicalism: a formal theory of naturalized induction for more details.

^{^}

The grain-of-truth problem shouldn't apply here, because we are not modeling the operator as optimal. This is what the intelligence measure gets around. There is still a problem as to priors.

^{^}

The Legg-Hutter measure of intelligence (p23) does this, and it appears to be Vanessa's intention.

LESSWRONG
LW

LESSWRONG
LW

59

Finding Goals in the World Model

59

Ω 24

59

Ω 24

Produced As Part Of The SERI ML Alignment Theory Scholars Program 2022 Under John Wentworth

Introduction

Architecture assumptions

IRL training procedure for an aligned EfficientZero style model

IRL proposal

Research required

Red-Team and Response

This is uncompetitive

The IRL process won’t converge to what we want

Inner misalignment

Remaining confusion

Updates

Jeremy

Thomas

James