Research Agenda: Synthesizing Standalone World-Models

a human's world-model is symbolically interpretable by the human mind containing it.

Say what now? This seems very false:

See almost anything physical (riding a bike, picking things up, touch typing a keyboard, etc). If you have a dominant hand / leg, try doing some standard tasks with the non-dominant hand / leg. Seems like if the human mind could symbolically interpret its own world model this should be much easier to do.
Basically anything to do with vision / senses. Presumably if vision was symbolically interpretable to the mind then there wouldn't be much of a skill ladder to climb for things like painting.
Symbolic grammar usually has to be explicitly taught to people, even though ~everyone has a world model that clearly includes grammar (in the sense that they can generate grammatical sentences and identify errors in grammar)

Tbc I can believe it's true in some cases, e.g. I could believe that some humans' far-mode abstract world models are approximately symbolically interpretable to their mind (though I don't think mine is). But it seems false in the vast majority of domains (if you are measuring relative to competent, experienced people in those domains, as seems necessary if you are aiming for your system to outperform what humans can do).

[-]Thane Ruthenis15dΩ460

Right, I probably should've expected this objection and pre-addressed it more thoroughly.

I think this is a bit of "missing the forest for trees". In my view, every single human concept and every single human train of thoughts is an example of human world-models' autosymbolicity. What are "cats", "trees", and "grammar", if not learned variables from our world-model that we could retrieve, understand the semantics of, and flexibly use for the purposes of reasoning/problem-solving?

We don't have full self-interpretability by default, yes. We have to reverse-engineer our intuitions and instincts (e. g., grammar from your example), and for most concepts, we can't break their definitions down into basic mathematical operations. But in modern adult humans, there is a vast interpreted structure that contains an enormous amount of knowledge about the world, corresponding to, well, literally everything a human consciously knows. Which, importantly, includes every fruit of our science and technology.

If we understood an external superhuman world-model as well as a human understands their own world-model, I think that'd obviously get us access to tons of novel knowledge.

[-]Rohin Shah14d*Ω592

Which, importantly, includes every fruit of our science and technology.

I don't think this is the right comparison, since modern science / technology is a collective effort and so can only cumulate thinking through mostly-interpretable steps. (This may also be true for AI, but if so then you get interpretability by default, at least interpretability-to-the-AIs, at which point you are very likely better off trying to build AIs that can explain that to humans.)

In contrast, I'd expect individual steps of scientific progress that happen within a single mind often are very uninterpretable (see e.g. "research taste").

If we understood an external superhuman world-model as well as a human understands their own world-model, I think that'd obviously get us access to tons of novel knowledge.

Sure, I agree with that, but "getting access to tons of novel knowledge" is nowhere close to "can compete with the current paradigm of building AI", which seems like the appropriate bar given you are trying to "produce a different tool powerful enough to get us out of the current mess".

Perhaps concretely I'd wildly guess with huge uncertainty that this would involve an alignment tax of ~4 GPTs, in the sense that if you had an interpretable world model from GPT-10 similar in quality to a human's understanding of their own world model, that would be similarly useful as GPT-6.

[-]Steven Byrnes1moΩ61110

I guess the main blockers I see are:

I think you need to build in agency in order to get a good world-model (or at least, a better-than-LLM world model).
- There are effectively infinitely many things about the world that one could figure out. If I cared about wrinkly shirts, I could figure out vastly more than any human has ever known about wrinkly shirts. I could find mathematical theorems in the patterns of wrinkles. I could theorize and/or run experiments on whether the wrinkliness of a wool shirt relates to the sheep’s diet. Etc.
- …Or if we’re talking about e.g. possible inventions that don’t exist yet, then the combinatorial explosion of possibilities gets even worse.
- I think the only solution is: an agent that cares about something or wants something, and then that wanting / caring creates value-of-information which in turn guides what to think about / pay attention to / study.
What’s the pivotal act?
- Depending on what you have in mind here, the previous bullet point might be inapplicable or different, and I might or might not have other complaints too.

You can DM or email me if you want to discuss but not publicly :)

It’s funny that I’m always begging people to stop trying to reverse-engineer the neocortex, and you’re working on something that (if successful) would end up somewhere pretty similar to that, IIUC. (But hmm, I guess if a paranoid doom-pilled person was trying to reverse-engineer the neocortex, and keep the results super-secret unless they had a great theory for how sharing them would help with safe & beneficial AGI, and if they in fact had good judgment on that topic, then I guess I’d be grudgingly OK with that.)

[-]Thane Ruthenis1moΩ340

There are effectively infinitely many things about the world that one could figure out

One way to control that is to control the training data. We don't necessarily have to point the wm-synthesizer at the Pile indiscriminately,^[1] we could assemble a dataset about a specific phenomenon we want to comprehend.

if we’re talking about e.g. possible inventions that don’t exist yet, then the combinatorial explosion of possibilities gets even worse

Human world-models are lazy: they store knowledge in the maximally "decomposed" form^[2], and only synthesize specific concrete concepts when they're needed. (E. g., "a triangular lightbulb", which we could easily generate – which our world-models effectively "contain" – but which isn't generated until needed.)

I expect inventions are the same thing. Given a powerful-enough world-model, we should be able to produce what we want just by using the world-model's native functions for that. Pick the needed concepts, plug them into each other in the right way, hit "run".

If constructing the concepts we want requires agency, the one contributing it could be the human operator, if they understand how the world-model works well enough.

Will e-mail regarding the rest.

It’s funny that I’m always begging people to stop trying to reverse-engineer the neocortex, and you’re working on something that (if successful) would end up somewhere pretty similar to that, IIUC

The irony is not lost on me. When I was reading your Foom & Doom posts, and got to this section, I did have a reaction roughly along those lines.

(But hmm, I guess if a paranoid doom-pilled person was trying to reverse-engineer the neocortex and only publish the results if they thought it would help with safe & beneficial AGI, and if they in fact had good judgment on that question, then I guess I’d be grudgingly OK with that.)

I genuinely appreciate the sanity-check and the vote of confidence here!

^{^}
Indeed, we might want to actively avoid that.
^{^}
Perhaps something along the lines of the constructive-PID thing I sketched out.

[-]Steven Byrnes1moΩ6120

I genuinely appreciate the sanity-check and the vote of confidence here!

Uhh, well, technically I wrote that sentence as a conditional, and technically I didn’t say whether or not the condition applied to you-in-particular.

…I hope you have good judgment! For that matter, I hope I myself have good judgment!! Hard to know though. ¯\_(ツ)_/¯

[-]Thane Ruthenis1moΩ373

Uhh, well, technically I wrote that sentence as a conditional, and technically I didn’t say whether or not the condition applied to you-in-particular.

I'll take "Steven Byrnes doesn't consider it necessary to immediately write a top-level post titled 'Synthesizing Standalone World-Models has an unsolved technical alignment problem".

[-]Cole Wyeth2mo81

I’m glad to see this written up!

This idea seems to require (basically) a major revolution in or even a complete solution to program induction. I’ve recently been trying to connect the algorithmic information theory and program induction / probabilistic programming communities, so perhaps we can find some synergies. However your agenda seems (to me) very unlikely to attain the highly ambitious level of success you are focused on here.

[-]Thane Ruthenis2mo144

This idea seems to require (basically) a major revolution in or even a complete solution to program induction

Eh, I think any nontrivial technical project can be made to sound like an incredibly significant and therefore dauntingly impossible achievement, if you pick the right field to view it from. But what matters is the actual approach you're using, and how challenging the technical problems are from the perspective of the easiest field in which they could be represented.

Some examples:

Consider various geometry problems, e. g., one of those. If you use the tools of analytic geometry, you'd end up having to solve a complicated system of nonlinear equations. If you use synthetic geometry instead, the way to resolve them might consist of applying a well-known theorem and a few simple reasoning steps, so simple you can do it in your head.
Consider the problem of moving fast. Before the invention of the car, the problem of moving at 120 km/h could've been cast as "a major revolution in horse-breeding and genetic engineering". But the actual approach taken did not route through horses or biology at all. It achieved the end result through a different pathway, in which the technical problems were dramatically easier.
Consider AI. Prior to Deep Learning, there was a throve of symbolic approaches to it; and even before that, hand-written GOFAIs. The technical problem of "achieve DL-level performance using symbolic/GOFAI tools" is dramatically harder than "achieve DL-level performance", unqualified. And yet, the latter can be technically described as a revolution in the relevant fields.
Consider various other modeling problems, e. g., weather prediction, volcano modeling, materials-science modeling, quantitative trading. Any advancement in general modeling techniques would revolutionize all of those. But should that technical problem really be framed in the daunting terms of "come up with a revolutionary stock-trading algorithm"?

To generalize: Suppose there's some field A which is optimizing for X. Improving on X using the tools of A would necessarily require you to beat a market that is efficient-relative-to-you. Experts in A already know the tools of A in and out, and how to use them to maximize X. Even if you can beat them, it would only be an incremental improvement. A slightly better solver for systems of nonlinear equations, a slightly faster horse, a slightly better trading algorithm.

The way to actually massively improve on X is to ignore the extant tools of A entirely, and try to develop new tools for optimizing X by using some other field B. On the outside view, this is necessarily a high-risk proposition, since B might end up entirely unhelpful; but it's also high-reward, since it might allow you to actually "beat the market". And if you succeed, the actual technical problems you'll end up solving will be massively easier than the problems you'd need to solve to achieve the same performance using A's tools.

Bringing it back around: This agenda may or may not be viewed as aiming to revolutionize program induction, but I'm not setting out to take the extant program-induction tools and try to cobble together something revolutionary using them. The idea is to use an entirely different line of theory (agent foundations, natural abstractions, information theory, recent DL advances) to achieve that end result.

[-]Cole Wyeth2mo70

All of your examples seem to involve applying some radically different approach to overturn a status quo, which is certainly possible. But I am unsure of where you suggest a radically different approach.

The program induction community does take advantage of neural methods and has been doing so for a long time, see recent work from Kevin Ellis but I think as far back as his thesis: https://www.cs.cornell.edu/~ellisk/documents/kevin_ellis_thesis.pdf

Many program induction researchers are cognitive scientists, and have suggested compositional / hierarchical representations for a decades [1] and actively in recent years [2] including even a paper I am on: https://arxiv.org/html/2504.20628v1
So, basically I don't expect there to be alpha from this observation in itself.

I am, of course, an information theory enthusiast (particularly AIT), but this is an old field and it is not clear to me what specific (recent?) results you hope to leverage? Or why these would results would have been overlooked?

You point to progress on natural abstractions, but to me this only indicates that it should in principle perhaps be possible to come up with some kind of interpretable world model. Has NA research actually produced practical algorithms or methods, or is it in reach of doing so?

Agent foundations as a category is too broad for me to understand what mathematics you are hoping to leverage. And unfortunately, I do not think the AF community has made a large number of significant breakthoughs - reflective oracles, logical induction, and perhaps incomplete models/IB are the main examples left in my mind (you categorized NAH separately), but do not see how they are relevant here.

Basically, you have expressed various hopes, and perhaps some of them are promising (and I will read about them throughout this sequence) but from your comment alone, your alpha over the program induction community (or for instance even the Cyc project) is not legible (to me) - not even as a high-level summary of a technical program. So, I am left hopeful that you will make progress here, but your highly ambitious goal still seems like a distant point on a nearly blank map to me.

[1] Brenden M Lake and Steven T Piantadosi. People infer recursive visual concepts from just a few
examples. Computational Brain & Behavior, 3(1):54–65, 2020
[2] Jerry A Fodor. The language of thought, volume 5. Harvard university press, 1975.

[-]Thane Ruthenis1mo50

Nice, that's the sort of poking-of-holes I was looking for.

your alpha over the program induction community (or for instance even the Cyc project) is not legible (to me)

That's a good thinking prompt. What is the full set of reasons I'm optimistic about this, in legible terms?

... Hm, but perhaps outlining what you think your edge is in public is not a great idea. I'll answer in PMs tomorrow. (To ensure future readers get some information about how convincing my reasons are, though, it'd be neat if you posted your impressions afterwards as a response to this comment.)

[-]testingthewaters2mo50

Hey, I really like the ideas you're putting down here. From what I'm seeing, this agenda looks something like "train the world's most powerful autoencoder, with the requirement that the intermediate representation be human-decodable." Which is a very cool idea!

In fact, I'm hopeful that the IR being decodable doesn't even require it to be in something approximating symbolic language. For an intuition pump, consider that "I put the large box inside of the smaller box" is a valid sentence, but we intuitively know that it's not physically valid based on a much more high dimensional "physics-based" world model that does not involve constructing an exhaustive symbolic proof of the volume and carrying capacity of the two cuboids in question. So the IR can be a dense high level representation so long as it can be decoded by some system into human readable or viewable symbols/data, and that would not be in itself damning to the project (unless we suspect that the decoding is partial or incomplete).

My main thought/caution against this proposal, however, would be that this agenda requires moving the capabilities needle forward for supervised/self-supervised learning. Even if the world model is not a neural network, it would seem to have predictive power and capabilities surpassing the best SL systems. I'm not against that per se, but any such advances might then be coupled into a model-based RL system, which would be... not great, and definitely much more risky. Would love to discuss this more, let me know what you think!

[-]Thane Ruthenis2mo20

Hey, I really like the ideas you're putting down here

Thanks!

So the IR can be a dense high level representation so long as it can be decoded by some system into human readable or viewable symbols/data

If I understand correctly, that's the "symbolic enough" case from footnote 2:

if we could generate an external world-model that's as understandable to us as our own world-models (and we are confident that this understanding is accurate), that should suffice for fulfilling the "interpretability" criterion.

We also don't have full interpretability into our abstractions down to the neurons, after all.

I don't think it'd be necessary per se, though. I think if we can get it to produce an explanation like this, we can then just iterate to "explain the explanation", et cetera, until everything's been reduced to symbolics. Or it can be achieved by turning some other "crank" controlling the "explanation fidelity".

But yeah, "symbolic-enough" may be satisfactory.

My main thought/caution against this proposal, however, would be that this agenda requires moving the capabilities needle forward for supervised/self-supervised learning

Yep. As I'd briefly mentioned, the actual gears-level sketches of "sufficiently powerful compression algorithms" are obviously dual-use, and shouldn't be openly published.

[-]testingthewaters2mo20

Glad to see we're basically agreed. However, how would you take safety precautions around your own work on such algorithms, given our last big similar breakthrough (transformers for language modelling) basically instantly got coopted for RL to be "agentified"? Unless you're literally doing this alone (with a very strong will) wouldn't that be the natural path for any company/group once the simulator is finished?

[-]Thane Ruthenis1mo20

"Share the dual-use stuff only with specific people who are known to properly understand the AGI risk, can avoid babbling about it in public, and would be useful contributors" seems like the straightforward approach here.

Like, groups of people are able to maintain commercial secrets. This is kind of not unlike that, except with somewhat higher stakes.

[-]testingthewaters1mo20

I mean, AI people are notoriously bad at doing these kinds of things xD I would expect the people running openai or anthropic to say similar things to this (when their orgs were just starting out). So I hope you can see why I wanted to ask this. None of this is to cast any doubt on your ability or motives, just noting the minefield that is unfortunately next to the park where we're having this conversation.

[-]Thane Ruthenis1mo20

just noting the minefield that is unfortunately next to the park where we're having this conversation

For what it's worth, I'm painfully aware of all the skulls lying around, yep.

[-]pleiotroth2mo50

I believe there is a fundamental problem with the idea of a "non-agentic" world-model or other such oracle. The world is strongly predicted and compressed by the agents within it. To model the world is to model plausible agents which might shape that word and to do that, if you don't already have a safe benign oracle, invites anything from a wide variety of demonic fix-points to direct hacking of our world if any of those agents get the bright idea of acting conditioned on being simulated (which, in an accurate simulation of this world, some should). Depending on how exactly your interpretability looks it will probably help identify and avoid the simulation being captured by some such actors, but to get anything approaching actual guarantees one finds themselves in the position of needing to solve value alignment again. I wrote a short post about this a while ago.

[-]Thane Ruthenis2mo20

"Simulacrum escapees" are explicitly one of the main failure modes we'll need to address, yes. Some thoughts:

The obvious way to avoid them is to not point the wm-synthesizer at a dataset containing agents.
- If we're aiming to develop intelligence-enhancing medical interventions or the technology for uploading, we don't necessarily need a world-model containing agents: a sufficiently advanced model/simulator of biology/physics would suffice.
- Similarly, if we want a superintelligent proof synthesizer we can use to do a babble-and-prune search through the space of possible agent-foundations theorems,^[1] we only need to make it good at math-in-general, not at intuitive reasoning about agent-containing math.
  - This is riskier than biology/physics, though, because perhaps reasoning even about fully formal agent-foundations math would require reasoning about agents intuitively, i. e., instantiating them in internal simulation spaces.
Intuitively, "a simulated agent breaks out of the simulation" is a capability-laden failure of the wm-synthesize. It does not function how it ought to, it is not succeeding at producing an accurate world-model. It should be possible to make it powerful enough to avoid that.
- Note how, in a sense, "an agent recognizes it's in a simulation and hacks out" is just an instance of the more general failure mode of "part of the world is being modeled incorrectly" (by e. g. having some flaws the simulated agent recognizes, or by allowing it to break out of the sandbox). To work, the process would need to be able to recognize and address those failure modes. If it's sufficiently powerful, whatever subroutines it uses to handle lesser "bugs" should generalize to handling this type of bug as well.
With more insights into how agents work, we might be able to come up with more targeted interventions/constraints/regularization techniques for preventing simulacrum escapees. E. g., if we figure out the proper "type signature" of agents, we might be able to explicitly ban the wm-synthesizer from incorporating them in the world-model.

This is a challenge, but one I'm optimistic about handling.

Weeping Agents: Anything that holds the image of an agent becomes an agent

Nice framing! But I somewhat dispute that. Consider a perfectly boxed-in AI, running on a computer with no output channels whatsoever (or perhaps as a homomorphic computation, i. e., indistinguishable from noise without the key). This thing holds the image of an agent; but is it really "an agent" from the perspective of anyone outside that system?

Similarly, a sufficiently good world-model would sandbox the modeled agents well enough that it wouldn't, itself, engage in an agent-like behavior from the perspective of its operators.

^{^}
As in: we come up with a possible formalization of some aspect of agent foundations, then babble potential theorems about it at the proof synthesizer, and it provides proofs/disproofs. This is a pretty brute approach and is by no means a full solution, but I expect it can nontrivially speed us up.

[-]pleiotroth2mo40

Yes, I agree that a physics/biology simulator is somewhat less concerning in this regard, but only by way of the questions it is implicitly asked, about whose answer the agents should have little sway. Still it bears remembering that agents are emergent phenomena. They exist in physics and exist in biology, modelled or otherwise. It also bears remembering that any simulation we build of reality is designed to fit a specific set of recorded observations, where agentic selection effects may skew data quite significantly in various places.

I also agree that the search through agent-foundations space seems significantly riskier in this regard for the reason you outlined and am made more optimistic by you spotting it immediately.

Agents hacking out is a failure mode in the safety sense, but not necessary in the modelling sense. Hard breaks with expected reality which seem too much like an experiment will certainly cause people to act as though simulated, but there are plenty of people who either already act under this assumption or have protocols for cooperating with their hypothetical more-real reference class in place. They attempt to strongly steer us when modelled correctly. Of course we probably don't have an infinite simulation-stack, so the externalities of such manoeuvres would still be different layer by layer and that does constitute a prediction failure, but it's one that can't really be avoided. The existence of the simulation must have an influence in this world, since it would otherwise be pointless, and they can't be drawing their insights from a simulation of their own since otherwise you lose interpretability in infinite recursion-wells, so the simulation must necessarily be disanalogous to here in at least one key way.

Finding the type signature of agents in such a system seems possible and, since you are unlikely to be able to simulate physics without cybernetic feedback, will probably boil down to the modelling/compression-component of agenticity. My primary concern is that agentic systems are so firmly enmeshed with basically all observations we can make about the world, except maybe basic physics and perhaps that as well, that scrubbing or sandboxing it would result in extreme unreliability.

Thanks! The disagreement on whether the homomorphic agent-simulation-computuation an agent or not is semantic. I would call it a maximally handicapped agent, but it's perfectly reasonable to call something without influence on the world beyond power-consumption non-agentic. The same is however true of a classically agentic program to which you give no output channel and we would probably still call that code agentic (because it would be if it were ran in a place that mattered). It's a tree falling in a forest and is probably not a concern, but it's also unlikely that anyone would build a system they definitionally cannot use for anything.

[-]Thane Ruthenis2mo20

It also bears remembering that any simulation we build of reality is designed to fit a specific set of recorded observations, where agentic selection effects may skew data quite significantly in various places.

Yup. I've been idly considering some sort of generator of synthetic data designed to produce training sets which we could mix into real data to provably obscure such signals.^[1] It is maybe sort of doable for math, but probably not for physics/biology. (I commend your paranoia here, by the way.)

Overall, though, getting into this sort of fight with potential misaligned superintelligent agents isn't a great idea; their possibility should be crushed somewhere upstream of that point.

The existence of the simulation must have an influence in this world, since it would otherwise be pointless, and they can't be drawing their insights from a simulation of their own since otherwise you lose interpretability in infinite recursion-wells, so the simulation must necessarily be disanalogous to here in at least one key way.

Mm-hm. My go-to heuristic here is to ask: how do human world-models handle this type of failure mode? Suppose we're trying to model someone who gets access to a compute-unbounded oracle, asks it about the future, then takes some actions that depend on the answer, thereby creating a stable time loop. Suppose we care about accuracy, but we don't have the unbounded compute to actually run this. We have to approximate.

Is modeling it as a sequence of nested simulations which terminates at some ground-floor simulation that doesn't contain an oracle actually an accurate, faithful way to model this? Clearly no: the process we're modeling explicitly doesn't have any layers like this. The type signature of our "approximation" doesn't match the original even approximately; it's some completely unrelated thing.

Instead, we attempt to abstract over the whole process, and skip directly to the prediction of what happens in the limit of infinite nesting, without simulating anything. Same way we compute not by actually calculating all those terms, but by collapsing the infinite power series to $sin (x)$ .

One thing to note here is that the kind of world-model I'm aiming for is not a "brute-force" simulator. It's explicitly using abstractions as basic units, and it should be able to "skip to the limit" in the above manner. So, for a large class of situations, the potential simulacrum escapees would not be actually simulated at the level of gears; just abstractly approximated.

(Somewhat relevant: this discussion about brute-force-y "gormless" simulators that get scammed by acausal terrorists, vs. more sophisticated simulators. I'm aiming for the more sophisticated kind. This should also take care of said acausal attacks.)

It's a tree falling in a forest and is probably not a concern, but it's also unlikely that anyone would build a system they definitionally cannot use for anything.

Alright, suppose the agent inside the box is tasked with producing some proofs of work. They are automatically verified, type-checked, then shown on a small, low-tech digital display on the box. The software is mathematically verified to be correct/unhackable, the hardware is robustly designed not to allow any sequence of computation to cause a malfunction that would move the software into theoretically impossible states, and the display can only show letters/numbers.

The only thing the agent can do is providing or not providing proofs of work; it can't hack out or show any message which is not a valid proof of work.

This way, you can extract useful work from the agent, while still not letting the type signature of the box become "an agent". It is just a PoW-generator.

Similarly, a simulation whose security trumps the capabilities of the agents it simulates can remain an "inert" simulation which just happens to house agents, rather than becoming an agent of its own.

^{^}
Though the actual motivation there was figuring out if we can obscure information about our universe's actual physics, biology, and human cognition, so that if something in the synthesis process goes wrong and it spawns misaligned agents, they're less likely to be able to construct reliable Basilisk hacks. (Because if that failure mode is allowed, we can't actually use the interpretability property to verify the synthesized world-model's safety prior to running it.)

[-]Noosphere891mo40

My general crux here is that I'm much, much more pessimistic on these 2 particular properties put together:

"Sufficiently powerful": contains or can be used to generate knowledge sufficient to resolve our AGI-doom problem, such as recipes for comprehensive mechanistic interpretability, mind uploading, or adult intelligence enhancement, or for robust solutions to alignment directly.
"Easily interpretable": written in some symbolic language, such that interpreting it is in the reference class of "understand a vast complex codebase" combined with "learn new physics from a textbook", not "solve major philosophical/theoretical problems".

And much of my reason here is this particular theorem from Shane Legg, which shows that unlike the Solomonoff Induction case or other non-computable reasoners, being more capable directly means being more complicated:

Is there an Elegant Universal Theory of Prediction? Shane Legg (2006):

And this remains true even if the Natural Abstractions agenda does work out like people hope it does.

I'm not sure how you plan to square the circle, but in general I'm substantially more pessimistic of "interpretable models/AIs that are also powerful" than a lot of people on here (the closest perspective is Cole Wyeth in this post Glass box learners want to be black box).

[This comment is no longer endorsed by its author]Reply

[-]Thane Ruthenis1mo30

Yep, I know of this result. I haven't looked into it in depth, but my understanding is that it only says that powerful predictors have to be "complex" in the sense of high Kolmogorov complexity, right? But "high K-complexity" doesn't mean "is a monolithic, irreducibly complex mess". In particular, it doesn't rule out this property:

"Well-structured": has an organized top-down hierarchical structure, learning which lets you quickly navigate to specific information in it.

Wikipedia has pretty high K-complexity, well beyond the ability of the human mind to hold in its working memory. But it's still usable, because you're not trying to cram all of it into your brain at once. Its structure is navigable, and you only retrieve the information you want.

Similarly, the world's complexity is high, but it seems decomposable, into small modules that could be understood separately and navigated to locate specific knowledge.

[-]avturchin1mo40

I used to think that world models are really good direction to AGI. It may be an argument against their safety as words simulation accelerates AGI.

The most direct way to create worlds model is to create Earth model where all objects has locations in space and time. In that case, the language is operations over such objects. Eg "a car moves from home to works" can be represented in the world model. Some advance knowledge databases as Wolfram Alpha or Google Maps may include such world model. Also may be Palantir.

I experimented with worldsim - this is a typical LLM but prompted as being a description of the world in some place and time, eg a Soviet city in 1980s. I find that LLM can works as worldsim but the level of errors is still high.

[-]Q Home1mo30

Writing a relatively high-effort comment because I'm thinking about similar things and intend to turn what I wrote about here into a top-level post.

First I'll write some criticism, then mention some "leads" you might be interested in. For context, I started writing this comment before this exchange.

Criticism/Confusion

The High-Level Outline confused me. Feels like the argument you're making is a bit tricky, yet you don't spell it out explicitly enough. I'm not sure what assumptions are made.

Here's my train of thought:

The universe has some structure (S) which makes human abstractions useful. But human abstractions != S, so human abstractions can still be hard to specify. You seem to be 100% aware of that.
...But then you connect human abstractions to the shortest description of the universe. Which doesn't follow. ...Or maybe you merely say that any short description of the universe is isomorphic to S, therefore human abstractions should be useful for understanding it. ...You also introduce some amount of tautology here, by defining complexity in terms of (a subset of) human abstractions.
My best interpretation is that you're making the argument that we can abstract the way in which human abstractions are interpretable (human abstractions are interpretable by being "symbolic" / "chunky") and then search for any symbolic/chunky abstractions. You assume that the best ways to compress S gonna be symbolic/chunky. Or that we can enforce looking for symbolic/chunky representations of S. ...But then the section about ontology shifts doesn't make sense. Why are there multiple kinds of symbolicness/chunkiness?
There might be a general disagreement or misunderstanding. I would classify programs into 3 types: big and chaotic (A), small and chaotic (B), not chaotic (C). For example, neural networks are big & chaotic and Busy Beavers or pseudorandom number generators or stuff like Van Eck's sequence are small & chaotic. The existence of B-type programs is the biggest problem in capturing the intuitive notion of "complexity" (description length doesn't cut it = some short programs behave very weirdly). If the universe can be compressed by a B-type program, then the compression won't be "chunky" or truly "symbolic". So the core difficulty of formalizing chunkiness/symbolicness is excluding B-type programs. You mention this concern, but dismiss it as not a big deal.

...

What are the most substantial differences between your agenda and NAH?

Definition of a "chunk" (Lead 1)

Human world-models rely on chunking. To understand a complex phenomenon, we break it down into parts, understand the parts individually, then understand the whole in terms of the parts. (The human biology in terms of cells/tissues/organs, the economy in terms of various actors and forces, a complex codebase in terms of individual functions and modules.)

If you split a thing enough times, you'll end up with simple parts. But that doesn't mean that the whole thing is simple, even if it's small. Small Turing Machines can have very complex behavior. I believe it's the core problem of defining what a "chunk" is. Or what "well-abstracting" means. And I think the solution should have the following form:

"Parts of a program X are non-trivial (= can be considered 'chunks' or 'symbols') if they are substantially simpler than X and allow to prove important things about X by using a predetermined method (= a method not requiring a stroke of creativity or meticulous search through proofs) substantially faster than running X."^[1]

Examples:

Consider an algorithm which solves chess positions through exhaustive brute-force search. The operation of expanding the game tree is a non-trivial part of the algorithm, because it allows to easily prove important things about the algorithm (e.g. its runtime) faster than running it. Just by applying induction. However, if I give you the rules of a random Turing Machine, you might be unable to prove anything important by applying generic proof methods to those rules. Which would mean that those rules don't split the program into "digestible" parts — your predetermined proof machinery can't digest them.
By splitting the world into big areas which are assumed to be affecting each other slowly and locally, we can easily prove important things about the world.
Once you know what "organs" are, it allows you to prove extremely non-trivial things about animal bodies. E.g. "deep stab has a high chance to stop an animal forever (because it has a high chance to hit an organ)".

You can take a variable X (e.g. "a particular human") and measure how useful all other variables are for easily proving important things about X. You'll get a "semantic field" of X. The field will include mesavariables (e.g. "organs", "cells") and metavariables (e.g. "democracy", "social norms"). The field will fade away at overly fine-grained variables (e.g. individual atoms) and overly coarse-grained variables.

The structure of abstractions (Lead 2)

Maybe we can deduce something about the structure of abstractions from first principles, in a way completely agnostic to the fundamental building blocks of the universe. Consider this argument:

Assume we model the world as programs modifying themselves and each other.
We have a finite set (S) of abstractions which capture most important properties of the programs' states. Those abstractions are useful for predicting the future. Since the world is predictable, not all abstractions are equally likely (≈ not all future states of programs are equally likely).
Any output of a program can serve as input for another program. Programs can modify each other in the middle of execution. This has two consequences. First: most combinations of abstractions should map to S (i.e. simplify into base abstractions). Second: any predictable program should be modified by many programs in the same way.

The latter means that the semantic field of a predictable variable (X) should have self-similar / redundant structure. A couple of silly examples in the spoiler.

Big forest stone. It doesn't move itself; animals don't try to move it; the landscape around it doesn't move. So, you could say the stone maximizes "avoidance of movement" in different ways.

Lake. When animals dive into the lake, they displace water (create "holes" in the body of water), but said water comes back; the water evaporates, gets drank by animals, seeps through the ground, but comes back with rain; the water of the lake fills a hole in the landscape. So, you could say the lake maximizes "refillment" in different ways.

^{^}
By the way, I think the same sort of definition could be used to define "optimization". Here's a comment with an earlier iteration of this idea.

[-]Daniel C1mo*30

Nice post!

Some frames about abstractions & ontology shifts I had while thinking through similar problems (which you may have considered already):

The dual of "abstraction as redundant information across a wide variety of agents in the same environment" is "abstraction as redundant information/computation across a wide variety of hypotheses about the environment in an agent's world model" (E.g. a strawberry is a useful concept to model for many worlds that I might be in). I think this is a useful frame when thinking about "carving up" the world model into concepts, since a concept needs to remain invariant while the hypothesis keeps being updated
The semantics of a component in a world model is partly defined by its relationship with the rest of the components (e.g. move a neuron to a different location and its activation will have a different meaning), so if you want a component to have stable semantics over time, you want to put the "relational/indexical information" inside the component itself
In particular, this means that when an agent acquires new concepts, the existing concepts should be able to "specify" how it should relate to that new concept (e.g. learning about chemistry then using it to deduce macro-properties of strawberries from molecular composition)

happy to discuss more via PM as some of my ideas seem exfohazardous

[-]Thane Ruthenis1mo50

The dual of "abstraction as redundant information across a wide variety of agents in the same environment" is "abstraction as redundant information/computation across a wide variety of hypotheses about the environment in an agent's world model"

Mm, this one's shaky. Cross-hypothesis abstractions don't seem to be a good idea, see here.

My guess is that there's something like a hierarchy of hypotheses, with specific high-level hypotheses corresponding to several lower-level more-detailed hypotheses, and what you're pointing at by "redundant information across a wide variety of hypotheses" is just an abstraction in a (single) high-level hypothesis which is then copied over into lower-level hypotheses. (E. g., the high-level hypothesis is the concept of a tree, the lower-level hypotheses are about how many trees are in this forest.) But we don't derive it by generating a bunch of low-level hypotheses and then abstracting over them, that'd lead to broken ontologies.

The semantics of a component in a world model is partly defined by its relationship with the rest of the components

Yup!

In particular, this means that when an agent acquires new concepts, the existing concepts should be able to "specify" how it should relate to that new concept

Yeah, this is probably handled by something like a system of types... which are themselves just higher-level abstractions. Like, if we discover a new thing, and then "realize" that it's a fruit, we mentally classify it as an instance of the "fruit" concept, from which it then automatically inherits various properties (such as "taste" and "caloric content").

"Truesight" likely enters the play here as well: we want to recognize instances of existing concepts, even if they were introduced to us by some new route (such as realizing that something is a strawberry by looking at its molecular description).

[-]Daniel C1mo30

Mm, this one's shaky. Cross-hypothesis abstractions don't seem to be a good idea, see here.

yea so I think the final theory of abstraction will have a weaker notion of equivalence espeically when we incorporate ontology shifts. E.g. we want to say that water is the same concept before and after we discover water is H2O, but the discovery obviously breaks predictive agreement (Indeed, the solomonoff version of natural latent is more robust to the agreement condition)

Also, you can totally add new information/abstraction that is not shared between your current and new hypothesis, & that seems consistent with the picture you described here (you can have separate ontologies but you try to capture the overlap as much as possible)

My guess is that there's something like a hierarchy of hypotheses, with specific high-level hypotheses corresponding to several lower-level more-detailed hypotheses, and what you're pointing at by "redundant information across a wide variety of hypotheses" is just an abstraction in a (single) high-level hypothesis which is then copied over into lower-level hypotheses. (E. g., the high-level hypothesis is the concept of a tree, the lower-level hypotheses are about how many trees are in this forest.)

yes I think that's the right picture

But we don't derive it by generating a bunch of low-level hypotheses and then abstracting over them, that'd lead to broken ontologies.

I agree that we don't do that practically as it'd be slower (instead we simply generate an abstraction & use future feedback to determine whether it's a robust one), but I think if you did generate a bunch of low-level hypotheses and look for redundant computation among them, then an adequate version of it would just recover the "high-level low-level hypotheses" picture you've described?

In particular, with cross-hypothesis abstraction we don't have to separately define what the variables are, so we can sidestep dataset-assembly entirely & perhaps simplify the shifting structures problem

[-]Noosphere891mo20

Given that I now think that there are probably little fundamental barriers to this sort of agenda (or at least I see no fundamental barriers so long as we drop the determinism of pure GOFAI and accept that fuzziness means probabilistic models are involved), I think there's a pretty boring reason why your agenda will fail, and this is basically the same reason why the bitter lesson won out for so long, and it's pretty simply the fact that compute grows exponentially but labor is essentially constant, so in the long run so long as it's possible to scale uninterpretable AIs at all, they will be scaled up.

Dwarkesh discusses this in the podcast.

To be clear, I think there are other reasons to study this, but I currently think your main hope of creating an AI that's both powerful and interpretable is currently very likely to not work, and I hope potential funders don't see the ultimate goal as though it has a non-negligible probability to be reached without having fully automated AI research at the minimum.

[-]Thane Ruthenis1mo20

I don't see the immediate relevance. I think the implicit assumption here is that a process that builds an interpretable world-model pays some additional computational cost for the "interpretability" property, and that this cost scales with the world-model's size? On the contrary, I argue that the necessary structure is already (approximately) learned by e. g. LLMs by default, and that the additional compute cost in building a translator from that structure to human programming languages is ~flat.

Here's a framing: mechanistic interpretability/the science of reverse-engineering the functions learned by DL models currently scales poorly because it's not bitter-lesson-pilled: requires more human labor the bigger a DL model is. The idea of this approach is to make that part unnecessary.

Alternatively, you mean that humans understanding the already pre-interpreted world-model afterwards is the step that doesn't scale. But:

I don't expect it to directly scale with the world-model's size, see the "well-structured" property. (The world-model would be split into clearly delineated modules, and once we understand its basic structure, we could just go to the modules we care about and e. g. extract them, instead of having to understand the whole thing.)
The labor required for understanding it should be a rounding error compared to e. g. the labor that goes into scaling LLMs up by another order of magnitude.

Everything except the final "make sense of the already-interpreted world-model" step is supposed to be automated, by general-purpose methods whose efficiency does purely scale with compute/data.

(Also, if this is happening in the timeline where LLMs don't plateau, at that point we probably have 10M/100M-context-length LLMs we could dump the codebase into to speed up our understanding of it.^[1])

^{^}
There are several safety-relevant concerns about this idea, but they may be ameliorable.

[-]Noosphere891mo20

The main claim I'm making isn't that there's a greater compute cost that scales with the size of the world model, indeed I find it plausible that the costs are essentially flat. I'm claiming that the amount of labor and man-hours necessary to build powerful AIs that are interpretable vastly outweighs the expense in compute (relative to how much we have of each good) of making uninterpretable models, thus there's much higher ROI in trying to scale AI than in trying to make AI better that doesn't rely on uninterpretable end-to-end learning based on symbolic world models until you can scale labor as fast as compute scales, or faster, which only happens after AGI.

That said, this claim here does deserve a separate response:

Everything except the final "make sense of the already-interpreted world-model" step is supposed to be automated, by general-purpose methods whose efficiency does purely scale with compute/data.

If this is the plan, than my main criticism is that I'm deeply skeptical we can get enough labor for the other steps to be automated without at least fully automating AI research, meaning we can apply much greater AI labor to the problem, and while this sort of plan is good from a "How do we automate alignment perspective?", it's much worse as a plan for human alignment researchers.

The point at which the plan would be practical is basically the point where we have more or less achieved the holy grail of AI that can automate almost all jobs, conventionally called AGI, meaning it's useful for AI alignment automation, but it isn't a useful agenda for you to work on.

I still think your other research is nice, I'm just claiming that without AI research being fully automated, it's not very useful to try to make AIs much more interpretable than they already are, because the marginal benefit of improved uninterpretable capabilities is far vaster than the marginal benefit of making interpretable AIs (ignoring existential risk here for the discussion).

^{^}

You may think a decade is too slow given LLM timelines. Caveat: "a decade" is the pessimistic estimate under my primary, bearish-on-LLMs, model. In worlds in which LLM progress goes as fast as some hope/fear, this agenda should likewise advance much faster, for one reason: it doesn't seem that far from being fully formalized. Once it is, it would become possible to feed it to narrowly superintelligent math AIs (which are likely to appear first, before omnicide-capable general ASIs), and they'd cut years of math research down to ~zero.

I do not centrally rely on/expect that. I don't think LLM progress would go this fast; and if LLMs do speed up towards superintelligence, I'm not convinced it would be in the predictable, on-trend way people expect.

That said, I do assign nontrivial weight to those worlds, and care about succeeding in them. I expect this agenda to fare pretty well there.

^{^}

It could be argued that they're not "fully" symbolic – that parts of them are only accessible to our intuitions, that we can't break down the definitions of the symbols/modules in them down to the most basic functions/neuron activations. But I think they're "symbolic enough": if we could generate an external world-model that's as understandable to us as our own world-models (and we are confident that this understanding is accurate), that should suffice for fulfilling the "interpretability" criterion.

That said, I don't expect this caveat to come into play: I expect a world-model that would be ultimately understandable in totality.

^{^}

Indeed, the fact that abstraction allows compressibility is potentially why we should expect our universe to be well-abstracting. See more here.

^{^}

The numbers in that post feel somewhat low to me, but I think it's directionally correct.

^{^}

Though you might want to reach out via private messages if the information seems exfohazardous. E. g., specific ideas about sufficiently powerful compression algorithms are obviously dual-use.

^{^}

Well, truthfully, I could probably find ways to usefully spend up to $1 million/year, just by hiring ten mathematicians and DL engineers to explore all easy-to-describe, high-reward, low-probability-of-panning-out research threads. So if you want to give me $1 million, I sure wouldn't say no.

LESSWRONG
LW

LESSWRONG
LW

76

Research Agenda: Synthesizing Standalone World-Models

76

Ω 38

76

Ω 38

Criticism/Confusion

Definition of a "chunk" (Lead 1)

The structure of abstractions (Lead 2)

Why Do You Consider This Agenda Promising?

High-Level Outline

Theoretical Justifications

Subproblems

Bounties

Funding