What I'm calling a simulator (following Janus's terminology) you call a predictor
Yup; I use the terms almost interchangeably. I tend to use "simulator" when referring to predictors used for a simulator-y use case, and "predictor" when I'm referring to how they're trained and things directly related to that.
I also like your metatoken concept: that's functionally what I'm suggesting for the tags in my proposal, except I follow the suggestion of this paper to embed them via pretraining.
Yup again—to be clear, all the metatoken stuff I was talking about would also fit in pretraining. Pretty much exactly the same thing. There are versions of it that might get some efficiency boosts by not requiring them to be present for the full duration of pretraining, but still similar in concept. (If we can show an equivalence between trained conditioning and representational interventions, and build representational interventions out of conditions, that could be many orders of magnitude faster.)
Alas, nope! To my knowledge it hasn't actually been tried at any notable scale; it's just one of those super simple things that would definitely work if you were willing to spend the compute to distill the behavior.
Signal boosted! This is one of those papers that seems less known that it should be. It's part of the reason why I'm optimistic about dramatic increases in the quality of "prosaic" alignment (in the sense of avoiding jailbreaks and generally behaving as expected) compared to RLHF, and I think it's part of a path that's robust enough to scale.
You can compress huge prompts into metatokens, too (just run inference with the prompt to generate the training data). And nest and remix metatokens together.
It's also interesting in that it can preserve the constraints on learnable values during predictive training, unlike approaches equivalent to RL with sparse/distant rewards.
The fact that the distinctions it learns about the metatokens become better and better as more optimization pressure is applied is an interesting inversion of the usual doom-by-optimization story. Taking such a model to the extreme of optimization just makes it exceedingly good at distinguishing subtle details of what constitutes <nice>
versus <authoritative_tone>
versus <correct>
. It's an axis of progress in alignment that generalizes as the capability does; the capability is the alignment. I'm pretty certain that a model that has very thoroughly learned what "nice" means at the human level can meaningfully generalize it to contexts where it hasn't seen it directly applied.[1]
I'm also reasonably confident in finding some other paths to extremely similar effects on internal representations. I wouldn't be surprised if we can decompose conditions into representational features to learn about what they mean at the learned feature level, then cobble together new inference-time conditions via representational intervention that would have equivalent effects to training new metatokens.
After all, ChatGPT4/DALLE3 can generate an image of a vacuum cleaner that "embodies the aspirational human trait of being kind to one another." That seems like more of a reach than a hypothetical superintelligence figuring out that humans wouldn't be okay with, say, a superscience plan that would blow up 25% of the earth's crust.
I claim we are many scientific insights away from being able to talk about these questions at the level of precision necessary to make predictions like this.
Hm, I'm sufficiently surprised at this claim that I'm not sure that I understand what you mean. I'll attempt a response on the assumption that I do understand; apologies if I don't:
I think of tools as agents with oddly shaped utility functions. They tend to be conditional in nature.
A common form is to be a mapping between inputs and outputs that isn't swayed by anything outside of the context of that mapping (which I'll term "external world states"). You can view a calculator as a coherent agent, but you can't usefully describe the calculator as a coherent agent with a utility function regarding world states that are external to the calculator's process.
You could use a calculator within a larger system that is describable as a maximizer over a utility function that includes unconditional terms for external world states, but that doesn't change the nature of the calculator. Draw the box around the calculator within the system? Pretty obviously a tool. Draw the box around the whole system? Not a tool.
I've been using the following two requirements to point at a maximally[1] tool-like set of agents. This composes what I've been calling goal agnosticism:
Note that this isn't the same thing as a definition for "tool." An idle rock uselessly obeys this definition; tools tend to useful for something. This definition is meant to capture the distinction between things that feel like tools and those that feel like "proper" agents.
To phrase it another way, the intuitive degree of "toolness" is a spectrum of how much the agent exhibits unconditional preferences about external world states through instrumental behavior.
Notably, most pretrained LLMs with the usual autoregressive predictive loss and a diverse training set are heavily constrained into fitting this definition. Anything equivalent to RL agents trained with sparse/distant rewards is not. RLHF bakes a condition into the model of peculiar shape. I wouldn't be surprised if it doesn't strictly obey the definition anymore, but it's close enough along the spectrum that it still feels intuitive to call it a tool.
Further, just like in the case of the calculator, you can easily build a system around a goal agnostic "tool" LLM that is not, itself, goal agnostic. Even prompting is enough to elicit a new agent-in-effect that is not necessarily goal agnostic. The ability for a goal agnostic agent to yield non-goal agnostic agents does not break the underlying agent's properties.[3]
For one critical axis in the toolishness basis, anyway.
Tricky stuff like having a bunch of terms regarding external world states that just so happen to always cancel don't count.
This does indeed sound kind of useless, but I promise the distinction does actually end up mattering quite a lot! That discussion goes beyond the scope of this post. The FAQ goes into more depth.
While this probably isn't the comment section for me to dump screeds about goal agnosticism, in the spirit of making my model more legible:
I think that if it is easy and obvious how to make a goal-agnostic AI into a goal-having AI, and also it seems like doing so will grant tremendous power/wealth/status to anyone who does so, then it will get done. And do think that these things are the case.
Yup! The value I assign to goal agnosticism—particularly as implemented in a subset of predictors—is in its usefulness as a foundation to build strong non-goal agnostic systems that aren't autodoomy. The transition out of goal agnosticism is not something I expect to avoid, nor something that I think should be avoided.
I think that a mish-mash of companies and individual researchers acting with little effective oversight will almost certainly fall off the path, and that even having most people adhering to the path won't be enough to stop catastrophe once someone has defected.
I'd be more worried about this if I thought the path was something that required Virtuous Sacrifice to maintain. In practice, the reason I'm as optimistic (nonmaximally pessimistic?) as I am that I think there are pretty strong convergent pressures to stay on something close enough to the non-autodoom path.
In other words, if my model of capability progress is roughly correct, then there isn't a notably rewarding option to "defect" architecturally/technologically that yields greater autodoom.
With regard to other kinds of defection:
I also think that misuse can lead more directly to catastrophe, through e.g. terrorists using a potent goal-agnostic AI to design novel weapons of mass destruction. So in a world with increasingly potent and unregulated AI, I don't see how to have much hope for humanity.
Yup! Goal agnosticism doesn't directly solve misuse (broadly construed), which is part of why misuse is ~80%-ish of my p(doom).
And I also don't see any easy way to do the necessary level of regulation and enforcement. That seems like a really hard problem. How do we prevent ALL of humanity from defecting when defection becomes cheap, easy-to-hide, and incredibly tempting?
If we muddle along deeply enough into a critical risk period slathered in capability overhangs that TurboDemon.AI v8.5 is accessible to every local death cult and we haven't yet figured out how to constrain their activity, yup, that's real bad.
Given my model of capability development, I think there are many incremental messy opportunities to act that could sufficiently secure the future over time. Given the nature of the risk and how it can proliferate, I view it as much harder to handle than nukes or biorisk, but not impossible.
Another experiment:
Extensions:
Some experimental directions I recently wrote up; might as well be public:
Less concretely, I'd also like to:
In retrospect, the example I used was poorly specified. It wouldn't surprise me if the result of the literal interpretation was "the AI refuses to play chess" rather than any kind of worldeating. The intent was to pick a sparse/distant reward that doesn't significantly constrain the kind of strategies that could develop, and then run an extreme optimization process on it. In other words, while intermediate optimization may result in improvements to chess playing, being better at chess isn't actually the most reliable accessible strategy to "never lose at chess" for that broader type of system and I'd expect superior strategies to be found in the limit of optimization.
But the point is that in this scenario the LM doesn't want anything in the behaviorist sense, yet is a perfectly adequate tool for solving long-horizon tasks. This is not the form of wanting you need for AI risk arguments.
My attempt at an ITT-response:
Drawing a box around a goal agnostic LM and analyzing the inputs and outputs of that box would not reveal any concerning wanting in principle. In contrast, drawing a box around a combined system—e.g. an agentic scaffold that incrementally asks a strong inner goal agnostic LM to advance the agent's process—could still be well-described by a concerning kind of wanting.
Trivially, being better at achieving goals makes achieving goals easier, so there's pressure to make system-as-agents which are better at removing wrenches. As the problems become more complicated, the system needs to be more responsible for removing wrenches to be efficient, yielding further pressure to give the system-as-agent more ability to act. Repeat this process a sufficient and unknown number of times and, potentially without ever training a neural network describable as having goals with respect to external world states, there's a system with dangerous optimization power.
(Disclaimer: I think there are strong repellers that prevent this convergent death spiral, I think there are lots of also-attractive-for-capabilities offramps along the worst path, and I think LM-like systems make these offramps particularly accessible. I don't know if I'm reproducing opposing arguments faithfully and part of the reason I'm trying is to see if someone can correct/improve on them.)
I think that'd be great!
Some of this stuff technically accelerates capabilities (or more specifically, the elicitation of existing capabilities), but I think it also belongs to a more fundamentally reliable path on the tech tree. The sooner the industry embraces it, the less time they spend in other parts of the tech tree that are more prone to misoptimization failures, and the less likely it is that someone figures out how to make those misoptimization failures way more efficient.
I suspect there's a crux about the path of capabilities development in there for a lot of people; I should probably get around to writing a post about the details at some point.