World-Model Interpretability Is All We Need

Here's a place where I want one of those disagree buttons separate from the downvote button :P

Given a world model that contains a bunch of different ways of modeling the same microphysical state (splitting up the same world into different parts, with different saliency connections to each other, like the discussion of job vs. ethnicity and even moreso), there can be multiple copies that coarsely match some human-intuitive criteria for a concept, given different weights by the AI. There will also be ways of modeling the world that don't get represented much at all, and which ways get left out can depend how you're training this AI (and a bit more subtly, how you're interpreting its parameters as a world model).

Especially because of that second part, finding good goals in an AI's world model isn't satisfactory if you're just training an fixed, arbitrary AI. Your process for finding good goals needs to interact with how the AI learns its mode of the world in the first place. In which case, world-model interpretability is not all we need.

[-]Thane Ruthenis3yΩ360

I agree that the AI would only learn the abstraction layers it'd have a use for. But I wouldn't take it as far as you do. I agree that with "human values" specifically, the problem may be just that muddled, but with none of the other nice targets — moral philosophy, corrigibility, DWIM, they should be more concrete.

The alternative would be a straight-up failure of the NAH, I think; your assertion that "abstractions can be on a continuum" seems directly at odds with it. Which isn't impossible, but this post is premised on the NAH working.

[-]davidad3yΩ4104

Not listed among your potential targets is “end the acute risk period” or more specifically “defend the boundaries of existing sentient beings,” which is my current favourite. It’s nowhere near as ambitious or idiosyncratic as “human values”, yet nowhere near as anti-natural or buck-passing as corrigibility.

[-]davidad3yΩ490

In my plan, interpretable world-modeling is a key component of Step 1, but my idea there is to build (possibly just by fine-tuning, but still) a bunch of AI modules specifically for the task of assisting in the construction of interpretable world models. In step 2 we’d throw those AI modules away and construct a completely new AI policy which has no knowledge of the world except via that human-understood world model (no direct access to data, just simulations). This is pretty well covered by your routes numbered 2 and 3 in section 1A, but I worry those points didn’t get enough emphasis and people focused more on route 1 there, which seems much more hopeless.

[-]wassname2y10

Ah, now it makes sense. I was wondering how world model interpretability leads to alignment rather than control. After all, I don't think you will get far controlling something smarter than you against its will. But alignment of value could scale with large gaps in intelligence.

When that 2nd phase, there are a few things you can do. E.g the 2nd phase reward function could include world model concepts like "virtue", or you could modify the world model before training.

[-]tailcalled3y82

I'm very optimistic about the feasibility of creating world-models with interpretable pointers to "objects". Things like chairs. In fact, my optimism is sufficiently strong that I tend to take such world-models for granted when thinking of how to achieve alignment. And furthermore I expect interpretable world models to be a necessary condition for alignment.

However, I'm very pessimistic about the feasibility of getting abstract things like "human values" and similar for free. Even complicated high-dimensional things like "humans", especially when meant to include e.g. uploads, are things I am not so optimistic about the feasibility of (especially once you consider certain challenges at the margins). It just doesn't seem like the methods that can be used to create world models have anything that would robustly capture such abstract things.

[-]Noosphere893y10

However, I'm very pessimistic about the feasibility of getting abstract things like "human values" and similar for free. Even complicated high-dimensional things like "humans", especially when meant to include e.g. uploads, are things I am not so optimistic about the feasibility of (especially once you consider certain challenges at the margins). It just doesn't seem like the methods that can be used to create world models have anything that would robustly capture such abstract things.

I'm confused at why you think this if you're very optimistic on getting interpretable pointers/world models to things. What makes values or abstract concepts different, exactly.

[-]tailcalled3y20

The most feasible concept of values learning that I've seen has been inverse reinforcement learning, but even that concept seems way too underdetermined to be sufficient for learning values. Whereas for simple objects it seems like there are lots of seemingly-sufficient ideas on the table, just waiting until the data gets good enough.

[-]ryan_greenblatt3yΩ4810

As an established case for tractability, we have the natural abstraction hypothesis. According to it, efficient abstractions are a feature of the territory, not the map (at least to a certain significant extent). Thus, we should expect different AI models to converge towards the same concepts, which also would make sense to us. Either because we're already using them (if the AI is trained on a domain we understand well), or because they'd be the same abstractions we'd arrive at ourselves (if it's a novel domain).

Even believing in a relatively strong version of the natural abstractions hypothesis doesn't (on its own) imply that we should be able to understand all concepts the AI uses. Just the ones which:

have natural abstractions
that the ai faithfully learns as opposed to devoting insufficient capacity to reach the natural abstraction
and humans can understand these natural abstractions

These three properties seem reasonably likely in practice for some common stuff like 'trees' or 'dogs'.

[-]the gears to ascension3y6-1

maybe-tangentially, 3d structured world models seem very interesting to me. eg, here's a talk I had open and popped over to LW to share: https://www.youtube.com/watch?v=QffGi9XUt1M

[-]Roger Dearnaley3y56

I'm not very scared of any AGI that isn't capable of being a scientist — it seems unlikely to be able to go FOOM. In order to do that, it needs to:

have multiple world models at the same time that disagree, and reason under uncertainty across them
do approximate Bayesian updates on their probability
plan conservatively under uncertainty, i.e have broken the Optimizer's Curse
creatively come up with new hypotheses, i.e. create new candidate world models
devise and carry out low-cost/risk experiments to distinguish between world models

I think it's going to be hard to do all of these things well if its world models aren't fairly modular and separable from the rest of its mental architecture.

One possibility that I find plausible as a path to AGI is if we design something like a Language Model Cognitive Architecture (LMCA) along the lines of AutoGPT, and require that its world model actually be some explicit combination of human natural language, mathematical equations, and executable code that might be fairly interpretable to humans. Then the only potions of its world model that are very hard to inspect are those embedded in the LLM component.

[-]Thane Ruthenis3y40

In order to [be a scientist], it needs to:

Yeah, that's where my current thinking is at as well. I wouldn't term it as having "multiple world models" — rather, as entertaining multiple possible candidates for the structure of some region of its world-model — but yes, I think we can say a lot about the convergent shape of world-models by reasoning from the idea that they need to be easy to adapt and recompose based on new evidence.

One possibility that I find plausible as a path to AGI is if we design something like a Language Model Cognitive Architecture (LMCA) along the lines of AutoGPT

I've also had this idea, as a steelman of the whole "externalized reasoning oversight" agenda — to prompt a LLM to generate a semantical world-model, with the LLM itself just playing the role of the planning process over it. However, I expect it wouldn't work as intended, for two reasons:

Inasmuch as it's successful, the world-model is unlikely to stay naively-human-interpretable. It'd drift towards alien wordings, concepts, connections. And even if we force it to look human-interpretable, stenography is convergent, and this sort of setup opens up many more dimensions in which to sneak in messages than standard chains-of-thoughts. And if we manage to defeat stenography as well, I then expect a WM to be forced to look like terabytes upon terabytes of complexly-interconnected text, each plan-making query on it generating mountains of data — perhaps too much to reasonably sort out. Tying-in to...
It'll probably be too computationally intensive to work at all. Humans explicitly running generally-intelligent queries on their world-models takes a lot of time already, compared to the speed at which our instincts work. If each step of a query required a whole LLM forward-pass, instead of the minimal function required for it? I expect it'd require orders of magnitude more compute than Earth is going to have in the near-term.

And these two points aren't independent: the more human-interpretable we'd force the WM to look, the more wasteful and impractical it'd be.

[-]Brendon_Wong2y10

One possibility that I find plausible as a path to AGI is if we design something like a Language Model Cognitive Architecture (LMCA) along the lines of AutoGPT, and require that its world model actually be some explicit combination of human natural language, mathematical equations, and executable code that might be fairly interpretable to humans. Then the only potions of its world model that are very hard to inspect are those embedded in the LLM component.

Cool! I am working on something that is fairly similar (with a bunch of additional safety considerations). I don't go too deeply into the architecture in my article, but would be curious what you think!

[-]Gordon Seidoh Worley3yΩ120

Reading this post I think it insufficiently addresses motivations, purpose, reward functions, etc. to make the bold claim that perfect world-model interpretability is sufficient for alignment. I think this because ontology is not the whole of action. Two agents with the same ontology and very different purposes would behave in very different ways.

Perhaps I'm being unfair, but I'm not convinced that you're not making the same mistake as when people claim any sufficiently intelligent AI would be naturally good.

[-]Thane Ruthenis3yΩ242

Two agents with the same ontology and very different purposes would behave in very different ways.

I don't understand this objection. I'm not making any claim isomorphic to "two agents with the same ontology would have the same goals". It sounds like maybe you think I'm arguing that if we can make the AI's world-model human-like, it would necessarily also be aligned? That's not my point at all.

The motivation is outlined at the start of 1A: I'm saying that if we can learn how to interpret arbitrary advanced world-models, we'd be able to more precisely "aim" our AGI at any target we want, or even manually engineer some structures over its cognition that would ensure the AGI's aligned/corrigible behavior.

[-]Gordon Seidoh Worley3yΩ120

Isn't a special case of aiming at any target we want the goals we would want it to have? And whatever goals we'd want it to have would be informed by our ontology? So what I'm saying is I think there's a case where the generality of your claim breaks down.

[-]Thane Ruthenis3y*Ω482

Goals are functions over the concepts in one's internal ontology, yes. But having a concept for something doesn't mean caring about it — your knowing what a "paperclip" is doesn't make you a paperclip-maximizer.

The idea here isn't to train an AI with the goals we want from scratch, it's to train an advanced world-model that would instrumentally represent the concepts we care about, interpret that world-model, then use it as a foundation to train/build a different agent that would care about these concepts.

[-]Noosphere893yΩ230

I think that the big claim the post relies on is that values are a natural abstraction, and the Natural Abstractions Hypothesis holds. Now this is admittedly very different from the thesis that value is complex and fragile.

It is not that AI would naturally learn human values, but that it's relatively easy for us to point at human values/Do What I Mean/Corrigibility, and that they are natural abstractions.

This is not a claim that is satisfied by default, but is a claim that would be relatively easy to satisfy if true.

The robust values hypothesis from DragonGod is worth looking at, too.

From the link below, I'll quote:

Consider the following hypothesis:

There exists a "broad basin of attraction" around a privileged subset of human values[1] (henceforth "ideal values") The larger the basin the more robust values are Example operationalisations[2] of "privileged subset" that gesture in the right direction: Minimal set that encompasses most of the informational content of "benevolent"/"universal"[3] human values The "minimal latents" of "benevolent"/"universal" human values Example operationalisations of "broad basin of attraction" that gesture in the right direction: A neighbourhood of the privileged subset with the property that all points in the neighbourhood are suitable targets for optimisation (in the sense used in #3 Larger neighbourhood → larger basin Said subset is a "naturalish" abstraction The more natural the abstraction, the more robust values are Example operationalisations of "naturalish abstraction" The subset is highly privileged by the inductive biases of most learning algorithms that can efficiently learn our universe More privileged → more natural Most efficient representations of our universe contain a simple embedding of the subset Simpler embeddings → more natural Points within this basin are suitable targets for optimisation The stronger the optimisation pressure applied for which the target is still suitable, the more robust values are. Example operationalisations of "suitable targets for optimisation": Optimisation of this target is existentially safe[4] More strongly, we would be "happy" (where we fully informed) for the system to optimise for these points.

This is an important hypothesis, since if it has a non-trivial chance of being correct, then AI Alignment gets quite easier. And given the shortening timelines, I think this is an important hypothesis to test.

Here's a link below for the robust values hypothesis:

https://www.lesswrong.com/posts/YoFLKyTJ7o4ApcKXR/disc-are-values-robust

[This comment is no longer endorsed by its author]Reply

[-]Thane Ruthenis3yΩ386

Now this is admittedly very different from the thesis that value is complex and fragile.

I disagree. The fact that some concept is very complicated doesn't mean it won't be necessarily represented in any advanced AGI's ontology. Humans' psychology, or the specific tools necessary to build nanomachines, or the agent foundation theory necessary to design aligned successor agents, are all also "complex and fragile" concepts (in the sense that getting a small detail wrong would result in a grand failure of prediction/planning), but we can expect such concepts to be convergently learned.

Not that I necessarily expect "human values" specifically to actually be a natural abstraction — an indirect pointer at "moral philosophy"/DWIM/corrigibility seem much more plausible and much less complex.

[-]Noosphere893y30

Sorry for misrepresenting your views.

[-]Gordon Seidoh Worley3yΩ000

I think that the big claim the post relies on is that values are a natural abstraction, and the Natural Abstractions Hypothesis holds. Now this is admittedly very different from the thesis that value is complex and fragile.
It is not that AI would naturally learn human values, but that it's relatively easy for us to point at human values/Do What I Mean/Corrigibility, and that they are natural abstractions.
This is not a claim that is satisfied by default, but is a claim that would be relatively easy to satisfy if true.

If this is the case, my concern seems yet more warranted, as this is hoping we won't suffer a false positive alignment scheme that looks like it could work but won't. Given the his cost of getting things wrong, we should minimize false positive risks which means not pursuing some ideas because the risk if they are wrong is too high.

[-]RogerDearnaley2y10

The issue might be that humans are too incoherent and philosophically confused for their "values" to stand for anything concrete — e. g., we almost certainly don't have concrete utility functions.

One of the most basic concepts in natural language processing is valence/sentiment extraction: "am I happy or sad about this?". This a direct measurement of "how well does the situation conform to my human values?": what we'd want the model to optimize. Even tiny Natural Language Processing networks have clearly interpret able signals (neurons, activations, linear probes etc) of valence/sentiment extraction. So this is really not hard to find: it stands out like a sore thumb as soon as you start analyzing human text. Humans are adaption-executing agents that try to optimize a complex mess of things, and "how optimal is this, and why?" is one of the main things that we talk/complain about all the time. Whether this system is in places incoherent or Dutch-bookable so doesn't match the theoretical requirements for a utility function is a separate question (and humans have numerous perceptual biases and often-unhelpful mental heuristics so the answer is almost certainly "yes"), but the basic signal is really easy to find.

^{^}

That is, in worlds that mostly agree with the models of Eliezer Yudkowsky/Nate Soars/John Wentworth, on which we need to get the AGI exactly right to survive.

^{^}

In theory, there should be a "buffer zone" of capability, between an AGI smart enough to model itself, and an AGI smart enough to hack through interpretability tools (e. g., humans are self-reflective, but not smart enough to do that).

But "is self-reflective" is also not a binary. An AGI's self-model can be more or less right. On the lower capability level, it'll probably be very flawed, therefore not very useful. On the flipside, if it's very close to reality, the AGI is likely to be smart enough that reading its mind is dangerous.

We may easily misjudge that, too. An AGI that achieved self-awareness is likely already on the cusp of its sharp left turn, past which it'd be unsafe to interpret. Depending on how sharp the turn is, that "buffer zone" may be passed in the blink of an eye, easily missed.

^{^}

Runtime-editability also reassures another concern: that the inferences the AGI makes at runtime would be encoded differently from the knowledge hard-coded into its parameters. But since both types of knowledge would be used as inputs into the same algorithms (the planner, the shards), there's probably no reason to expect much mutation by default, due to the need for backwards compatibility.

(To be clear, once the AGI undergoes the sharp left turn/goes FOOM, and starts designing successor agents or directly modifying itself or just becomes incomprehensibly superintelligent, then this'll obviously stop applying. But if we haven't aligned it by then, we're dead either way, so that's irrelevant.)

LESSWRONG
LW

LESSWRONG
LW

36

World-Model Interpretability Is All We Need

36

Ω 11

36

Ω 11

1. Introduction

1A. Why Aim For This?

1B. Is It A Realistic Goal?

2. Would World-Models Look Like We Imagine?

2A. Are World-Models Necessary?

2B. How Are World-Models Useful?

2C. Are World-Models Unitary?

2D. Are World-Models Modular?

3. World-Model Structure

3A. Major Sub-Modules

3B. Abstractions As Basic Units

3C. Higher-Level Organization

3D. Laziness

4. Research Directions