I believe there is a fundamental problem with the idea of a "non-agentic" world-model or other such oracle. The world is strongly predicted and compressed by the agents within it. To model the world is to model plausible agents which might shape that word and to do that, if you don't already have a safe benign oracle, invites anything from a wide variety of demonic fix-points to direct hacking of our world if any of those agents get the bright idea of acting conditioned on being simulated (which, in an accurate simulation of this world, some should). Depending on how exactly your interpretability looks it will probably help identify and avoid the simulation being captured by some such actors, but to get anything approaching actual guarantees one finds themselves in the position of needing to solve value alignment again. I wrote a short post about this a while ago.
"Simulacrum escapees" are explicitly one of the main failure modes we'll need to address, yes. Some thoughts:
This is a challenge, but one I'm optimistic about handling.
Weeping Agents: Anything that holds the image of an agent becomes an agent
Nice framing! But I somewhat dispute that. Consider a perfectly boxed-in AI, running on a computer with no output channels whatsoever (or perhaps as a homomorphic computation, i. e., indistinguishable from noise without the key). This thing holds the image of an agent; but is it really "an agent" from the perspective of anyone outside that system?
Similarly, a sufficiently good world-model would sandbox the modeled agents well enough that it wouldn't, itself, engage in an agent-like behavior from the perspective of its operators.
As in: we come up with a possible formalization of some aspect of agent foundations, then babble potential theorems about it at the proof synthesizer, and it provides proofs/disproofs. This is a pretty brute approach and is by no means a full solution, but I expect it can nontrivially speed us up.
Yes, I agree that a physics/biology simulator is somewhat less concerning in this regard, but only by way of the questions it is implicitly asked, about whose answer the agents should have little sway. Still it bears remembering that agents are emergent phenomena. They exist in physics and exist in biology, modelled or otherwise. It also bears remembering that any simulation we build of reality is designed to fit a specific set of recorded observations, where agentic selection effects may skew data quite significantly in various places.
I also agree that the search through agent-foundations space seems significantly riskier in this regard for the reason you outlined and am made more optimistic by you spotting it immediately.
Agents hacking out is a failure mode in the safety sense, but not necessary in the modelling sense. Hard breaks with expected reality which seem too much like an experiment will certainly cause people to act as though simulated, but there are plenty of people who either already act under this assumption or have protocols for cooperating with their hypothetical more-real reference class in place. They attempt to strongly steer us when modelled correctly. Of course we probably don't have an infinite simulation-stack, so the externalities of such manoeuvres would still be different layer by layer and that does constitute a prediction failure, but it's one that can't really be avoided. The existence of the simulation must have an influence in this world, since it would otherwise be pointless, and they can't be drawing their insights from a simulation of their own since otherwise you lose interpretability in infinite recursion-wells, so the simulation must necessarily be disanalogous to here in at least one key way.
Finding the type signature of agents in such a system seems possible and, since you are unlikely to be able to simulate physics without cybernetic feedback, will probably boil down to the modelling/compression-component of agenticity. My primary concern is that agentic systems are so firmly enmeshed with basically all observations we can make about the world, except maybe basic physics and perhaps that as well, that scrubbing or sandboxing it would result in extreme unreliability.
Thanks! The disagreement on whether the homomorphic agent-simulation-computuation an agent or not is semantic. I would call it a maximally handicapped agent, but it's perfectly reasonable to call something without influence on the world beyond power-consumption non-agentic. The same is however true of a classically agentic program to which you give no output channel and we would probably still call that code agentic (because it would be if it were ran in a place that mattered). It's a tree falling in a forest and is probably not a concern, but it's also unlikely that anyone would build a system they definitionally cannot use for anything.
I’m glad to see this written up!
This idea seems to require (basically) a major revolution in or even a complete solution to program induction. I’ve recently been trying to connect the algorithmic information theory and program induction / probabilistic programming communities, so perhaps we can find some synergies. However your agenda seems (to me) very unlikely to attain the highly ambitious level of success you are focused on here.
This idea seems to require (basically) a major revolution in or even a complete solution to program induction
Eh, I think any nontrivial technical project can be made to sound like an incredibly significant and therefore dauntingly impossible achievement, if you pick the right field to view it from. But what matters is the actual approach you're using, and how challenging the technical problems are from the perspective of the easiest field in which they could be represented.
Some examples:
To generalize: Suppose there's some field A which is optimizing for X. Improving on X using the tools of A would necessarily require you to beat a market that is efficient-relative-to-you. Experts in A already know the tools of A in and out, and how to use them to maximize X. Even if you can beat them, it would only be an incremental improvement. A slightly better solver for systems of nonlinear equations, a slightly faster horse, a slightly better trading algorithm.
The way to actually massively improve on X is to ignore the extant tools of A entirely, and try to develop new tools for optimizing X by using some other field B. On the outside view, this is necessarily a high-risk proposition, since B might end up entirely unhelpful; but it's also high-reward, since it might allow you to actually "beat the market". And if you succeed, the actual technical problems you'll end up solving will be massively easier than the problems you'd need to solve to achieve the same performance using A's tools.
Bringing it back around: This agenda may or may not be viewed as aiming to revolutionize program induction, but I'm not setting out to take the extant program-induction tools and try to cobble together something revolutionary using them. The idea is to use an entirely different line of theory (agent foundations, natural abstractions, information theory, recent DL advances) to achieve that end result.
tl;dr: I outline my research agenda, post bounties for poking holes in it or for providing general relevant information, and am seeking to diversify my funding sources. This post will be followed by several others, providing deeper overviews of the agenda's subproblems and my sketches of how to tackle them.
Back at the end of 2023, I wrote the following:
I'm fairly optimistic about arriving at a robust solution to alignment via agent-foundations research in a timely manner. (My semi-arbitrary deadline is 2030, and I expect to arrive at intermediate solid results by EOY 2025.)
On the inside view, I'm pretty satisfied with how that is turning out. I have a high-level plan of attack which approaches the problem from a novel route, and which hopefully lets us dodge a bunch of major alignment difficulties (chiefly the instability of value reflection, which I am MIRI-tier skeptical of tackling directly). I expect significant parts of this plan to change over time, as they turn out to be wrong/confused, but the overall picture should survive.
Conceit: We don't seem on the track to solve the full AGI alignment problem. There's too much non-parallelizable research to do, too few people competent to do it, and not enough time. So we... don't try. Instead, we use adjacent theory to produce a different tool powerful enough to get us out of the current mess. Ideally, without having to directly deal with AGIs/agents at all.
More concretely, the ultimate aim is to figure out how to construct a sufficiently powerful, safe, easily interpretable, well-structured world-model.
Some elaborations:
Safety: The problem of making it safe is fairly nontrivial: a world-model powerful enough to be useful would need to be a strongly optimized construct, and strongly optimized things are inherently dangerous, agent-like or not. There's also the problem of what had exerted this strong optimization pressure on it: we would need to ensure the process synthesizing the world-model isn't itself the type of thing to develop an appetite for our lightcone.
But I'm cautiously optimistic this is achievable in this narrow case. Intuitively, it ought to be possible to generate just an "inert" world-model, without a value-laden policy (an agent) on top of it.
That said, this turning out to be harder than I expect is certainly one of the reasons I might end up curtailing this agenda.
Interpretability: There are two primary objections I expect here.
On the inside view, this problem, and the subproblems it decomposes into, seems pretty tractable. Importantly, it seems tractable using a realistic amount of resources (a small group of researchers, then perhaps a larger-scale engineering effort for crossing the theory-practice gap), in a fairly short span of time (I optimistically think 3-5 years; under a decade definitely seems realistic).[1]
On the outside view, almost nobody has been working on this, and certainly not using modern tools. Meaning, there's no long history of people failing to solve the relevant problems. (Indeed, on the contrary: one of its main challenges is something John Wentworth and David Lorell are working on, and they've been making very promising progress recently.)
On the strategic level, I view the problem of choosing the correct research agenda as the problem of navigating between two failure modes:
I think most alignment research agendas, if taken far enough, do produce ASI-complete alignment schemes eventually. However, they significantly differ in how long it takes them, and how much data they need. Thus, you want to pick the starting point that gets you to ASI-complete alignment in as few steps as possible: with the least amount of concretization or generalization.
Most researchers disagree with most others regarding what that correct starting point is. Currently, this agenda is mine.
As I'd stated above, I expect significant parts of this to turn out confused, wrong, or incorrect in a technical-but-not-conceptual way. This is a picture is painted with a fairly broad brush.
I am, however, confident in the overall approach. If some of its modules/subproblems turn out faulty, I expect it'd be possible to swap them for functional ones as we go.
1. Proof of concept. Note that human world-models appears to be "autosymbolic": able to be parsed as symbolic structures by the human mind in which they're embedded.[2] Given that the complexity of things humans can reason about is strongly limited by their working memory, how is this possible?
Human world-models rely on chunking. To understand a complex phenomenon, we break it down into parts, understand the parts individually, then understand the whole in terms of the parts. (The human biology in terms of cells/tissues/organs, the economy in terms of various actors and forces, a complex codebase in terms of individual functions and modules.)
Alternatively, we may run this process in reverse. To predict something about a specific low-level component, we could build a model of the high-level state, then propagate that information "downwards", but only focusing on that component. (If we want to model a specific corporation, we should pay attention to the macroeconomic situation. But when translating that situation into its effects on the corporation, we don't need to model the effects on all corporations that exist. We could then narrow things down further, to e. g. predict how a specific geopolitical event impacted an acquaintance holding a specific position at that corporation.)
Those tricks seem to work pretty well for us, both in daily life and in our scientific endeavors. It seems that the process of understanding and modeling the universe can be broken up into a sequence of "locally simple" steps: steps which are simple given all preceding steps. Simple enough to fit within a human's working memory.
To emphasize: the above implies that the world's structure has this property at the ground-true level. The ability to construct such representations is an objective fact about data originating from our universe; our universe is well-abstracting.
The Natural Abstractions research agenda is a formal attempt to model all of this. In its terms, the universe is structured such that low-level parts of the systems in it are independent given their high-level state. Flipping it around: the high-level state is defined by the information redundantly represented in all low-level parts.
That greatly simplifies the task. Instead of defining some subjective, human-mind-specific "interpretability" criterion, we simply need to extract this objectively privileged structure. How can we do so?
2. Compression. Conceptually, the task seems fairly easy. The kind of hierarchical structure we want to construct happens to also be the lowest-description-length way to losslessly represent the universe. Note how it would follow the "don't repeat yourself" principle: at every level, higher-level variables would extract all information shared between the low-level variables, such that no bit of information is present in more than one variable. More concretely, if we wanted to losslessly transform the Pile into a representation that takes up the least possible amount of disk space, a sufficiently advanced compression algorithm would surely exploit various abstract regularities and correspondences in the data – and therefore, it'd discover them.
So: all we need is to set up a sufficiently powerful compression process, and point it at a sufficiently big and diverse dataset of natural data. The output would be isomorphic to a well-structured world-model.
... If we can interpret the symbolic language it's written in.
The problem with neural networks is that we don't have the "key" for deciphering them. There might be similar neat structures inside those black boxes, but we can't get at them. How can we avoid this problem here?
By defining "complexity" as the description length in some symbolic-to-us language, such as Python.
3. How does that handle ontology shifts? Suppose that this symbolic-to-us language would be suboptimal for compactly representing the universe. The compression process would want to use some other, more "natural" language. It would spend some bits of complexity defining it, then write the world-model in it. That language may turn out to be as alien to us as the encodings NNs use.
The cheapest way to define that natural language, however, would be via the definitions that are the simplest in terms of the symbolic-to-us language used by our complexity-estimator. This rules out definitions which would look to us like opaque black boxes, such as neural networks. Although they'd technically still be symbolic (matrix multiplication plus activation functions), every parameter of the network would have to be specified independently, counting towards the definition's total complexity. If the core idea regarding the universe's "abstraction-friendly" structure is correct, this can't be the cheapest way to define it. As such, the "bridge" between the symbolic-to-us language and the correct alien ontology would consist of locally simple steps.
Alternate frame: Suppose this "correct" natural language is theoretically understandable by us. That is, if we spent some years/decades working on the problem, we would have managed to figure it out, define it formally, and translate it into code. If we then looked back at the path that led us to insight, we would have seen a chain of mathematical abstractions from the concepts we knew in the past (e. g., 2025) to this true framework, with every link in that chain being locally simple (since each link would need to be human-discoverable). Similarly, the compression process would define the natural language using the simplest possible chain like this, with every link in it locally easy-to-interpret.
Interpreting the whole thing, then, would amount to: picking a random part of it, iteratively following the terms in its definition backwards, arriving at some locally simple definition that only uses the terms in the initial symbolic-to-us language, then turning around and starting to "step forwards", iteratively learning new terms and using them to comprehend more terms.
I. e.: the compression process would implement a natural "entry point" for us, a thread we'd be able to pull on to unravel the whole thing. The remaining task would still be challenging – "understand a complex codebase" multiplied by "learn new physics from a textbook" – but astronomically easier than "derive new scientific paradigms from scratch", which is where we're currently at.
(To be clear, I still expect a fair amount of annoying messiness there, such as code-golfing. But this seems like the kind of problem that could be ameliorated by some practical tinkering and regularization, and other "schlep".)
4. Computational tractability. But why would we think that this sort of compressed representation could be constructed compute-efficiently, such that the process finishes before the stars go out (forget "before the AGI doom")?
First, as above, we have existence proofs. Human world-models seem to be structured this way, and they are generated at fairly reasonable compute costs. (Potentially at shockingly low compute costs.[3])
Second: Any two Turing-complete languages are mutually interpretable, at the flat complexity cost of the interpreter (which depends on the languages but not on the program). As the result, the additional computational cost of interpretability – of computing a translation to the hard-coded symbolic-to-us language – would be flat.
5. How is this reconciled with the failures of previous symbolic learning systems? That is: if the universe has this neat symbolic structure that could be uncovered in compute-efficient ways, why didn't pre-DL approaches work?
This essay does an excellent job explaining why. To summarize: even if the final correct output would be (isomorphic to) a symbolic structure, the compute-efficient path to getting there, the process of figuring that structure out, is not necessarily a sequence of ever-more-correct symbolic structures. On the contrary: if we start from sparse hierarchical graphs, and start adding provisions for making it easy to traverse their space in search of the correct graph, we pretty quickly arrive at (more or less) neural networks.
However: I'm not suggesting that we use symbolic learning methods. The aim is to set up a process which would output a highly useful symbolic structure. How that process works, what path it takes there, how it constructs that structure, is up in the air.
Designing such a process is conceptually tricky. But as I argue above, theory and common sense say that it ought to be possible; and I do have ideas.
The compression task can be split into three subproblems. I will release several posts exploring each subproblem in more detail in the next few days (or you can access the content that'd go into them here).
Summaries:
1. "Abstraction-learning". Given a set of random low-level variables which implement some higher-level abstraction, how can we learn that abstraction? What functions map from the molecules of a cell to that cell, from a human's cells to that human, from the humans of a given nation to that nation; or from the time-series of some process to the laws governing it?
As mentioned above, this is the problem the natural-abstractions agenda is currently focused on.
My current guess is that, at the high level, this problem can be characterized as a "constructive" version of Partial Information Decomposition. It involves splitting (every subset of) the low-level variables into unique, redundant, and synergistic components.
Given correct formal definitions for unique/redundant/synergistic variables, it should be straightforwardly solvable via machine learning.
Current status: the theory is well-developed and it appears highly tractable.
2. "Truesight". When we're facing a structure-learning problem, such as abstraction-learning, we assume that we get many samples from the same fixed structure. In practice, however, the probabilistic structures are themselves resampled.
Examples:
I. e.,
On a sample-to-sample basis, we can't rely on any static abstraction functions to be valid. We need to search for appropriate ones "at test-time": by trying various transformations of the data until we spot the "simple structure" in it.
Here, "simplicity" is defined relative to the library of stored abstractions. What we want, essentially, is to be able to recognize reoccurrences of known objects despite looking at them "from a different angle". Thus, "truesight".
Current status: I think I have a solid conceptual understanding of it, but it's at the pre-formalization stage. There's one obvious way to formalize it, but it seems best avoided, or only used as a stepping stone.
3. Dataset-assembly. There's a problem:
Thus, subproblem 3: how to automatically spot ways to slice the data into datasets entries from which are isomorphic to samples from some fixed probabilistic structure, to make them suitable for abstraction-learning.
Current status: basically greenfield. I don't have a solid high-level model of this subproblem yet, only some preliminary ideas.
1. Red-teaming. I'm interested in people trying to find important and overlooked-by-me issues with this approach, so I'm setting up a bounty: $5-$100 for spotting something wrong that makes me change my mind. The payout scales with impact.
Fair warnings:
A reasonable strategy here would be to write up a low-effort list of one-sentence summaries of potential problems you see, I'll point out which seem novel and promising at a glance, and you could expand on those.
2. Blue-teaming. I am also interested in people bringing other kinds of agenda-relevant useful information to my attention: relevant research papers or original thoughts you may have. Likewise, a $5-$100 bounty on that, scaling with impact.[4]
I will provide pointers regarding the parts I'm most interested in as I post more detailed write-ups on the subproblems.
Both bounties will be drawn from a fixed pool of $500 I've set aside for this. I hope to scale up the pool and the rewards in the future. On that note...
I'm looking to diversify my funding sources. Speaking plainly, the AI Alignment funding landscape seems increasingly captured by LLMs; I pretty much expect only the LTFF would fund me. This is an uncomfortable situation to be in, since if some disaster were to befall the LTFF, or if the LTFF were to change priorities as well, I would be completely at sea.
As such:
Regarding target funding amounts: I currently reside in a country with low costs of living, and I don't require much compute at this stage, so the raw resources needed are small; e. g., $40k would cover me for a year. That said, my not residing in the US increasingly seems like a bottleneck on collaborating with other researchers. As such, I'm currently aiming to develop a financial safety pillow, then immigrate there. Funding would be useful up to $200k.[5]
If you're interested in funding my work, but want more information first, you can access a fuller write-up through this link.
If you want a reference, reach out to @johnswentworth.
Crypto
BTC: bc1q7d8qfz2u7dqwjdgp5wlqwtjphfhct28lcqev3v
ETH: 0x27e709b5272131A1F94733ddc274Da26d18b19A7
SOL: CK9KkZF1SKwGrZD6cFzzE7LurGPRV7hjMwdkMfpwvfga
TRON: THK58PFDVG9cf9Hfkc72x15tbMCN7QNopZ
Preference: Ethereum, USDC stablecoins.
You may think a decade is too slow given LLM timelines. Caveat: "a decade" is the pessimistic estimate under my primary, bearish-on-LLMs, model. In worlds in which LLM progress goes as fast as some hope/fear, this agenda should likewise advance much faster, for one reason: it doesn't seem that far from being fully formalized. Once it is, it would become possible to feed it to narrowly superintelligent math AIs (which are likely to appear first, before omnicide-capable general ASIs), and they'd cut years of math research down to ~zero.
I do not centrally rely on/expect that. I don't think LLM progress would go this fast; and if LLMs do speed up towards superintelligence, I'm not convinced it would be in the predictable, on-trend way people expect.
That said, I do assign nontrivial weight to those worlds, and care about succeeding in them. I expect this agenda to fare pretty well there.
It could be argued that they're not "fully" symbolic – that parts of them are only accessible to our intuitions, that we can't break down the definitions of the symbols/modules in them down to the most basic functions/neuron activations. But I think they're "symbolic enough": if we could generate an external world-model that's as understandable to us as our own world-models (and we are confident that this understanding is accurate), that should suffice for fulfilling the "interpretability" criterion.
That said, I don't expect this caveat to come into play: I expect a world-model that would be ultimately understandable in totality.
The numbers in that post feel somewhat low to me, but I think it's directionally correct.
Though you might want to reach out via private messages if the information seems exfohazardous. E. g., specific ideas about sufficiently powerful compression algorithms are obviously dual-use.
Well, truthfully, I could probably find ways to usefully spend up to $1 million/year, just by hiring ten mathematicians and DL engineers to explore all easy-to-describe, high-reward, low-probability-of-panning-out research threads. So if you want to give me $1 million, I sure wouldn't say no.