Using encoders/decoders seems the most likely. My guess is something like the following: the model will produce a neuralese trace+output, and then a separate (possibly fine tuned) instance of the same model will be fed the neuralese and asked to summarize it in detail, will produce a much longer, detailed summary (which we won't readily be able to verify, to be clear) and the summary will be graded.
J Bostock: The technique you describe here seems like it's very vulnerable to the decoder model colluding with the policy (and this seems super likely if they're the same model!).
Something like this seems more promising if we:
But I'm skeptical of (2) working because maybe the encode / decode is just really hard to do well.
The technique you describe here seems like it's very vulnerable to the decoder model colluding with the policy
Yes, I was claiming that this was likely, not that it was desirable.
I don't think neuralese models really lose that much, except for the monitoring ability (which is extremely important, don't get me wrong).
For example, I think you could do a variant of deliberative alignment, even for fully recurrent models.
What neuralese loses you from an "alignment"-tuning perspective, is the ability to update model behavior based on how CoT *looks*. But my sense is most people already think this is a bad idea and try to avoid it. Even in Deliberative Alignment, you do a little bit of this. But even then, its mostly on policy with stuff like rejection sampling or context distillation. And I have still heard people (justifiably) be worried about that aspect
Eg
For example, currently we train models to reason in a good way about math problems, or to reason in a desired way about the spec that we hope they’ll follow. It’s not obvious that we’ll be able to do training that affects model reasoning like this if models have opaque reasoning though, because we can’t just write the reasoning ourselves and do SFT on the reasoning trace.
In the two examples you gave I think you are not right? Why can't we do these? In neither of them do we write the reasoning traces ourselves, or look at them -> modify them -> train on modified traces.
The model generates reasoning traces, initially from the pretraining prior, then we select which to train on based on the output. The only think we can't do is eg filter out reasoning traces that don't adequately follow the spec, but my experience is that this is mostly an efficiency thing. The process works if you omit this step, just somewhat worse.
In your comment you bring up the question "Is training the reasoning trace to look nice a bad idea"?
I want to defend some forms of training the reasoning trace to look nice.
For instance, I think it'd be totally reasonable to impose some structure on the reasoning such as:
I think I'm probably more into using SFT to shape reasoning than RL.
I agree that "train the AI to not mention reward hacking in it's CoT" won't work if you do it the most naive way, as OAI found https://openai.com/index/chain-of-thought-monitoring/. But I actually think it's plausible that if you applied more pressure to keep the CoT faithful (which I don't have great ideas for how to do atm), this could work.
I agree there are reasons to do it. But I disagree with the specific examples you gave to be honest.
My pitch for doing training on reasoning-traces is this: Aligning deep learning systems is hard, because we train on outputs, and therefore can't know the process that produced those outputs. But CoT is an exception to this. The CoT seems in some sense load-bearing. Which means, intervening on it, might allow us to change models behavior by changing why they do certain things.
But I think this is very risky. And has to be used with a great degree of care. And I'd prefer only doing it for stuff that really matters, like alignment.
In general I think training on reasoning traces should mostly be done if you can:
But I actually think it's plausible that if you applied more pressure to keep the CoT faithful (which I don't have great ideas for how to do atm), this could work.
I mean, if we had a way to guarantee that CoT remained faitful, even as we do random tampering with it, that would be great news. We could just train the CoT to look nice and aligned. And we'd have solved at least half of alignment that way.
I agree that we should be pretty careful with how we do this, and should mostly avoid doing it for now. But it seems potentially good to do research on process-based supervision for reasoning.
williawa: I feel like a large part of the pitch for deliberative alignment is that we get to shape the model's reasoning about the spec.
For example, currently we train models to reason in a good way about math problems, or to reason in a desired way about the spec that we hope they’ll follow. It’s not obvious that we’ll be able to do training that affects model reasoning like this if models have opaque reasoning though, because we can’t just write the reasoning ourselves and do SFT on the reasoning trace.
I was wrong about math problems --- oops! We typically just supervise on the outputs for math. Thanks for catching this.
For the "reasoning about the spec" thing, I'm confused why the filtering step isn't necessary.
My interpretation of DA is
Both parts can be done without looking at the reasoning traces. The filtering stage is more for efficiency. Like I've done deliberative alignment experiments, and when you do step 1 without rejection sampling, you get CoTs that mention relevant parts of the spec 80-90% of the time.
Then if you train on it, you get a model that mentions it 80-90% of the relevant times. This is mostly fine. Its just a bit less robust. It usually brings up relevant parts of the spec in obvious instances. But eg, if you do the deliberative alignment in 1-turn only, and generalize to multi-turn, it might only mention the spec 50% of the time its relevant, but will do it 90% of the time if you do the rejection sampling.
Point being that its not a fundamentally required part of the technique, just used for robustness and reliability.
Like to make it more clear, another interpretation is that deliberative alignment / consitutional AI, is just like RLHF, except you give the model some conceptual clothes-hangers, so that it can better understand why it should do a given thing.
From that perspective, its obvious why deliberative alignment would work even without looking at the reasoning trace.
However, our reasoning encoders will probably be extremely lossy, so this will probably both be very off-policy and cause lots of performance degradation.
The most obvious reason for using neuralese would be to make the information bottleneck from the end back to the beginning of the transformer less narrow (and stochastic) But it should still be possible to map pretty accurately between neuralese and text, and back, if you use enough text. So I suspect encoders/decoders that are not very lossy will still be possible, they'll just need generate/ingest a lot more than one token of text per neuralese activation: perhaps one or more sentences per activation, for example. If so, you could still do SFT fairly accurately: you'd just need to translate a lot of text into neuralese.
Roger: I'm much more bearish on the encode/decode being possible. Sometimes I think in terms of images or video! Other times I have intuitions that I can't put into words when pressed. Also, there might just not be human interpretable words for explaining what's going on in the AI's computation.
Well, humans are clearly multi-modal, and in the case of a multi-modal LLM, some tokens are images (or parts of one). So presumably in neuralese in a multi-modal LLM, activations would contain a combination of textual and image components (and any other modes the LLM had, which might explain your thoughts that are neither: did they have an emotional component that could be expressed as theme music, or a dynamic that could be expressed in dance?). But it could still be encoded/decoded to that combination. The translation isn't going to be lossless, but with enough text and images and whatever it could be fairly low loss. We have activation oracles now, they do decode activations to multiple token strings, and they do seems to do a passable job of it. Encoding again is just standard LLM behavior.
We can estimate the number of tokens: perplexity tells you roughly how many bits there are per token, and the dimensionality of the activation space times the minimum number of bits for passable quantization gives you a rough estimate of the amount of data in a single neuralese activation. It's a lot: you probably need somewhere O(1000) tokens to accurately express one. Maybe a paragraph and a small image, rather than a sentence. It's not an information bottleneck: that's actually the complexity of the activation stream at each layer inside the model.
Is your proposal that we have a different (possibly weaker / "trusted") model write a reasoning trace which is part English and part image tokens and stuff, and then we "SFT" on that? I feel like the issue is that a model of a different family might not know the right concept space to write to.
For the translation to/from neuralese, are you familiar with “LatentQA: Teaching LLMs to Decode Activations Into Natural Language” and “Activation Oracles:Training and Evaluating LLMs as General-Purpose Activation Explainers”, and that line of research? That's what I'm envisaging, plus the reverse (which I suspect is easier).
As for where you get the trace from, that's as varied as usual for SFT. We could even try writing them by hand, once we had the hang of what the model’s CoT looks like. In practice, you'd probably get it from another reasoning model, preferably one as similar as possible, just as is usual for CoT SFT now.
The Platonic Representation Hypothesis makes me hopeful that these traces might generalize passably across model families, as current CoT traces do. Thinkish make me less optimistic,
Current LLMs externalize lots of their reasoning in human interpretable language. This reasoning is sometimes unfaithful, sometimes strange and concerning, and LLMs can do somewhat impressive reasoning without using CoT, but my overall impression is that CoT currently is a reasonably complete and accurate representation of LLM reasoning. However, reasoning in interpretable language might turn out to be uncompetitive—if so, it seems probable that opaque reasoning will be adopted in frontier AI labs.
If future AI models have opaque reasoning, this will probably change what training we can apply to these AIs. For example, currently we train models to reason in a good way about math problems, or to reason in a desired way about the spec that we hope they’ll follow. It’s not obvious that we’ll be able to do training that affects model reasoning like this if models have opaque reasoning though, because we can’t just write the reasoning ourselves and do SFT on the reasoning trace. Prior work on training-based control has also discussed using SFT as a technique for:
However, if we couldn’t generate reasoning (because the model’s reasoning is opaque) to go along with the output labels that we want to train on, then these techniques might not work anymore.
Opaque reasoning will only be adopted if there is still some way of efficiently training powerful models with opaque reasoning. So, the main concern that we have is that opaque reasoning will be differentially bad for training methods targeted at achieving control compared to general capabilities.
This post will speculate on what SFT training-based control will look like for models with opaque reasoning, and then comment on how this should adjust our research priorities in training-based control.
I strongly hope that this post will never become useful because future models will not have opaque reasoning. But reality doesn’t care about this hope of mine much. Hence this post.
Thanks to Sebastian Prasanna for discussion about opaque reasoning.
Thanks to Sebastian Prasanna, James Lucassen, and Eric Gan for feedback on this post.
What might opaque reasoning look like?
To understand how effective SFT will be against hypothetical future models with lots of opaque reasoning (i.e., much more than the current forward pass of a transformer), it’s helpful to have a somewhat concrete picture of what these opaque reasoning models might look like. (Although I think it’d probably be bad to push the frontier of neuralese models by thinking about what they might look like too much (only probably because of overhang concerns)).
Here are some non-exhaustive possibilities:
I think that opaque reasoning seems most likely to arise during RL, and think that initializing models from a pre-training prior seems likely.
What are our options for doing SFT on models with opaque reasoning, and how well will they work?
In this section, I’ll list several techniques for doing SFT on models with opaque reasoning, and discuss how effective I expect them to be. It’s worth noting that a major bottleneck to all SFT-based methods is that getting labels in hard domains will likely be hard, or even impossible. Here’s the list:
What should researchers working on training-based control do now?
I’ll conclude by discussing how the above analysis should change training-based control research.
Some training based control research should be targeted at non-reasoning models. There are lots of basic phenomena that we don’t understand in non-reasoning models. Unless you are mainly interested in studying training with reasoning, I think it’s best to just avoid reasoning models. Equivalently, you can use reasoning models, but just not treat the reasoning specially. That is, you allow the blue team to train on the reasoning. This could also be seen as a proxy for studying the case where we have good translation tools, or where SFT kind of works.
However, the current situation is that non-reasoning malign inits are basically trivially defeated by silly blue-team training measures. That is, we have failed to create good non-reasoning malign inits for training-based control. Based on concurrent research that my team is doing, I believe this to be a pretty hard problem. Currently, reasoning models seem like the most promising way for the red team to potentially create a malign init that is resistant to training. Playing a red-blue game on reasoning models (where we treat the reasoning as opaque) thus seems like a high priority research direction for diffuse control (which my team plans to investigate in the near future).
How should we study training-based control on reasoning models? Here are some options:
We hope that future work studying training-based control on models where we treat the reasoning as opaque can shed light on what the situation will be like if future models have much more opaque reasoning than they do currently.