Using encoders/decoders seems the most likely. My guess is something like the following: the model will produce a neuralese trace+output, and then a separate (possibly fine tuned) instance of the same model will be fed the neuralese and asked to summarize it in detail, will produce a much longer, detailed summary (which we won't readily be able to verify, to be clear) and the summary will be graded.
However, our reasoning encoders will probably be extremely lossy, so this will probably both be very off-policy and cause lots of performance degradation.
The most obvious reason for using neuralese would be to make the information bottleneck from the end back to the beginning of the transformer less narrow (and stochastic) But it should still be possible to map pretty accurately between neuralese and text, and back, if you use enough text. So I suspect encoders/decoders that are not very lossy will still be possible, they'll just need generate/ingest a lot more than one token of text per neuralese activation: perhaps one or more sentences per activation, for example. If so, you could still do SFT fairly accurately: you'd just need to translate a lot of text into neuralese.
I don't think neuralese models really lose that much, except for the monitoring ability (which is extremely important, don't get me wrong).
For example, I think you could do a variant of deliberative alignment, even for fully recurrent models.
What neuralese loses you from an "alignment"-tuning perspective, is the ability to update model behavior based on how CoT *looks*. But my sense is most people already think this is a bad idea and try to avoid it. Even in Deliberative Alignment, you do a little bit of this. But even then, its mostly on policy with stuff like rejection sampling or context distillation. And I have still heard people (justifiably) be worried about that aspect
Eg
For example, currently we train models to reason in a good way about math problems, or to reason in a desired way about the spec that we hope they’ll follow. It’s not obvious that we’ll be able to do training that affects model reasoning like this if models have opaque reasoning though, because we can’t just write the reasoning ourselves and do SFT on the reasoning trace.
In the two examples you gave I think you are not right? Why can't we do these? In neither of them do we write the reasoning traces ourselves, or look at them -> modify them -> train on modified traces.
The model generates reasoning traces, initially from the pretraining prior, then we select which to train on based on the output. The only think we can't do is eg filter out reasoning traces that don't adequately follow the spec, but my experience is that this is mostly an efficiency thing. The process works if you omit this step, just somewhat worse.
Current LLMs externalize lots of their reasoning in human interpretable language. This reasoning is sometimes unfaithful, sometimes strange and concerning, and LLMs can do somewhat impressive reasoning without using CoT, but my overall impression is that CoT currently is a reasonably complete and accurate representation of LLM reasoning. However, reasoning in interpretable language might turn out to be uncompetitive—if so, it seems probable that opaque reasoning will be adopted in frontier AI labs.
If future AI models have opaque reasoning, this will probably change what training we can apply to these AIs. For example, currently we train models to reason in a good way about math problems, or to reason in a desired way about the spec that we hope they’ll follow. It’s not obvious that we’ll be able to do training that affects model reasoning like this if models have opaque reasoning though, because we can’t just write the reasoning ourselves and do SFT on the reasoning trace. Prior work on training-based control has also discussed using SFT as a technique for:
However, if we couldn’t generate reasoning (because the model’s reasoning is opaque) to go along with the output labels that we want to train on, then these techniques might not work anymore.
Opaque reasoning will only be adopted if there is still some way of efficiently training powerful models with opaque reasoning. So, the main concern that we have is that opaque reasoning will be differentially bad for training methods targeted at achieving control compared to general capabilities.
This post will speculate on what SFT training-based control will look like for models with opaque reasoning, and then comment on how this should adjust our research priorities in training-based control.
I strongly hope that this post will never become useful because future models will not have opaque reasoning. But reality doesn’t care about this hope of mine much. Hence this post.
Thanks to Sebastian Prasanna for discussion about opaque reasoning.
Thanks to Sebastian Prasanna, James Lucassen, and Eric Gan for feedback on this post.
What might opaque reasoning look like?
To understand how effective SFT will be against hypothetical future models with lots of opaque reasoning (i.e., much more than the current forward pass of a transformer), it’s helpful to have a somewhat concrete picture of what these opaque reasoning models might look like. (Although I think it’d probably be bad to push the frontier of neuralese models by thinking about what they might look like too much (only probably because of overhang concerns)).
Here are some non-exhaustive possibilities:
I think that opaque reasoning seems most likely to arise during RL, and think that initializing models from a pre-training prior seems likely.
What are our options for doing SFT on models with opaque reasoning, and how well will they work?
In this section, I’ll list several techniques for doing SFT on models with opaque reasoning, and discuss how effective I expect them to be. It’s worth noting that a major bottleneck to all SFT-based methods is that getting labels in hard domains will likely be hard, or even impossible. Here’s the list:
What should researchers working on training-based control do now?
I’ll conclude by discussing how the above analysis should change training-based control research.
Some training based control research should be targeted at non-reasoning models. There are lots of basic phenomena that we don’t understand in non-reasoning models. Unless you are mainly interested in studying training with reasoning, I think it’s best to just avoid reasoning models. Equivalently, you can use reasoning models, but just not treat the reasoning specially. That is, you allow the blue team to train on the reasoning. This could also be seen as a proxy for studying the case where we have good translation tools, or where SFT kind of works.
However, the current situation is that non-reasoning malign inits are basically trivially defeated by silly blue-team training measures. That is, we have failed to create good non-reasoning malign inits for training-based control. Based on concurrent research that my team is doing, I believe this to be a pretty hard problem. Currently, reasoning models seem like the most promising way for the red team to potentially create a malign init that is resistant to training. Playing a red-blue game on reasoning models (where we treat the reasoning as opaque) thus seems like a high priority research direction for diffuse control (which my team plans to investigate in the near future).
How should we study training-based control on reasoning models? Here are some options:
We hope that future work studying training-based control on models where we treat the reasoning as opaque can shed light on what the situation will be like if future models have much more opaque reasoning than they do currently.