I think there is some significance to what you're saying. But there are also several things you say that I think don't make sense.
Except that if the chain of thought is neuralese, then we actually can differentiate it!
Firstly, this isn't entirely true. A big part of of RLVR is that the model can learn to use more thinking to solve harder problems. And this is not differentiable, because the decision to think for "one more step" is inherently discrete. Whether that extra step means outputting an extra token, or doing an extra recurrent forward pass in an RNN like model.
Second, token-based CoT is in a sense still differentiable. It's true that it cannot be backpropagated through, but if you look at e.g. the GRPO objective (I removed the clipping and kl-divergence term and consider one prompt, for clarity):
Now, you basically sample trajectories for a prompt, compute rewards, compute advantages, compute likelihood ratios, put that in here, do derivatives, and you get an unbiased estimate for
But
So if we take the reward to be the probability assigned to your SFT completion, we optimizing the same objective, and are in a sense differentiating the chain of thought. The main difference is sample efficiency. Which is a big difference to be clear, and makes some techniques feasible that otherwise wouldn't be. But its not a clear cut categorical difference.
Lastly, and this is somewhat controversial I think. But, the worry with all gradient based alignment approaches is that they give you a way to make the output look the way you want on the training distribution, without giving you any guarantees about the underlying generating process being what you want.
CoT-based alignment techniques give you a limited way around this. The CoT is part of the generating process. Most of the computation is not there, but its an important bottleneck.
CoT-based alignment like OpenAI's Deliberative Alignment, does a tiny bit of process supervision on the CoT. This intervenes on the way the model produces the output. This is good. Has a better chance of creating non-scheming agents.
It of course risks creating obfuscated or deceptive chain of thought. You're simultaneously training for generating processes that look good and that are good. But that is better than neuralese where you have no insight into the generating process to burn in the first place.
Some papers do fixed length reasoning, which is underpowered for some things and wasteful for others, but it is an option. Also, just doing RL to figure out the length of the reasoning seems way better than trying to do it for every single token of the reasoning process itself.
Anyway, yeah, you can "differentiate" just about anything with policy gradient, and the only issue is sparseness of supervision. So if you want your reward to be
"Firstly, this isn't entirely true. A big part of of RLVR is that the model can learn to use more thinking to solve harder problems. And this is not differentiable, because the decision to think for "one more step" is inherently discrete. Whether that extra step means outputting an extra token, or doing an extra recurrent forward pass in an RNN like model."
I don't see why this is inherently undifferentiable. Just let the model choose a probability p of continuing for a fixed extra number of tokens, then your function is:
You are right. I meant that you can’t backprop through it even with an RNN.
You can differentiate it. That’s what you do with LLMs and RL similar to the policy gradient I described above.
So a nice recipe for learning an inexactly graded problem is:
- The dataset is a bunch of examples of good answers.
- The loss function is the exact same as regular SFT (namely, cross-entropy).
- The only difference is that models do many steps of neuralese reasoning before answering, and we backpropagate through the entire reasoning chain.
To be clear, backpropagation through time is not a recent idea at all. It goes all the way back to RNNs. Or see here for a modern version. But because of its ability to train problems with inexact grading, I claim this recipe should be getting more attention as a path to better-aligned models than we currently give it.
One issue with this approach is that it also inherits all of the vanishing/exploding gradient pathologies of BPTT which plagued RNNs.
There are things you can do about this. The most obvious one is to normalize the gradients at each stage. Also, LSTMs were created to help with this issue.
If these fixes are not enough to make neuralese work, you can take this post as claiming that neuralese would be good if it worked. :)
The post leans on a premise it never states: that training the same imitation objective via BPTT through a neuralese chain yields a more aligned model than reaching that same objective some other way.
The post establishes that neuralese chains can be optimized with cross-entropy + backprop where token chains can't be differentiated through. But "now trainable by gradients" and "now more aligned" are different claims, and the post slides from one to the other without argument.
It doesn't go through for free, because BPTT isn't the only way to drive that objective down. You can target the same loss with policy gradient using reward = (log-)likelihood of the answer, you can use ES, which optimizes a Gaussian-smoothed version by perturbing parameters: a different gradient, not even an estimate of the SFT one, and still decreases the same loss. In general any procedure that empirically pushes the model toward predicting the answer is a candidate. Being a faithful estimator of ∇log P(answer) isn't required, it's just one option.
So these methods are objective-equivalent: same loss, same fixed point, differing only in how they get there. Which means the alignment ranking the post wants ("the BPTT route is safer") can't come from the objective, it has to come from a claim about the optimizer's inductive bias, that following direct gradients installs a more benign solution than sampling-based search does.
That might well be true or false, the point isn't that direct gradients give a worse prior, it's that the post's whole claim lays on this claim, which is unclear, but never states it as a premise.
Um, okay, sure: I hereby declare as a premise that using policy gradient or evolution strategies to increase
Note that I'm not claiming that the results of those methods would necessarily be less aligned. I'd have to think more about that to make a prediction. But if they are not sample efficient, then in practice people are going to train using either reward models, or with neuralese. And I did in fact argue why training a reward model would have worse alignment properties. Eg: "... the subjectivity of grading makes the grader subject to adversarial optimization. Cheater strategies like prompt-injecting the grader to get a higher score are incentivized."
I don't think BPTT is more sample efficient than GRPO or ES. It is just way more compute efficient, but not necessarily more sample efficient, because the second depends more on the "inductive bias" of the optimizer rather than it's efficiency, like how prone it is to produce an overfitting solution. And in that sense, I don't think BPTT is better, it is probably worse than e.g a random search. And also, there are papers that suggest that CoT RL is very sample efficient.
But still, even if it's easier to optimize the model for a certain reward (e.g assume limited compute). That doesn't mean it's good for alignment. If you're suggesting bypassing the reward model because the reward model is easy to fool, so then you directly optimize human scores, then you're learning to cheat the human, which is worse, maybe not for learning to write jokes, but for other types of tasks it could be dangerous.
Maybe what you're saying is that since neuralese models are easy to train with BPTT on predictive tasks, you can build a safer oracle AI. (e.g, you don't get a human scoring the jokes, you just minimize crossentropy on a set of jokes using a model with more computation available (CoT/Neuralese)), but then again, you can do this with RL or ES too.
But the deeper problem I think might be conflating reward with utility?. Because even if you're optimistic about RLVR/RLHF, the training distribution of tasks and rewards are very different with the actual tasks the model will see during deployment and what we want from it. What we do is just train the model in this setup with some method and expect it to generalize to the deployment distribution. And the only thing an optimizer can offer in that sense, is to provide a "more benign prior" in some sense. Like, as a toy example that maybe BPTT would simply learn to write a joke when trained to minimize CE on jokes, but RL would learn that it's an evil model that will take over humanity, and thus write really good jokes to get power. But I doubt whether that's true, or actually the opposite, and especially whether, if it were true, it would be worth losing monitorability for that.
Maybe your point could be that, as you could train an "Oracle AI" very easily with BPTT, then you would get safer AIs that can only do predictive rather than agentic tasks (e.g create a joke in the distribution of good existing jokes, predict the future ...). But the problem with oracle AIs is that they're not very useful as-is, especially for companies and users, they would likely apply some RL on top to make it an agent. And again, the only advantage would come from BPTT having instilled more "benign" strategies into the model.
Hmm, maybe we can do a sample efficiency bet? Do you think that using policy gradient to train a transformer to maximize
Trying to optimize the chain of thought means we need some kind of automated grader.
Trying to optimize the CoT is the very thing which safety-concerned AI labs do NOT want to do. Suppose that a perfect training environment has no ways for the reward system to see the CoT, only the output. Then cheating strategies like using the CoT to prompt-inject the grader would become useless, causing any hacks to be reflected IN THE OUTPUT.
Additionally, I don't think that our lack of ability to "differentiate through the CoT" is relevant for alignment. It is relevant for the model's ability to figure out what went wrong and to avoid repeating the mistakes.
Any time a lab does RLVR with chain of thought reasoning, they are optimizing the chain of thought, because some chains are more likely to result in a correct answer than others. But obviously letting your reward model see the chain of thought is bad, and is an unforced error in the case of RLVR.
As for the other point, the alternative to differentiating the chain is RL. The point is to avoid RL, and in particular reward models. Are you asking why avoiding RL / reward models is good, or something else?
EDIT: Just to clarify, letting the reward model see the chain of thought is bad whether the task is exactly gradable or not.
How are we to tell apart RL from differentiating the CoT? I thought that both processes are teaching the model to have an output such that it is graded well by some verification process (Lean verification? Having a reward model estimate the output's correctness? Or letting an LLM read through it and grade it?)
Ability to differentiate the CoT creates the additional option of just training with cross entropy loss against known-good answers. This does not require a reward model. As williawa points out, one can kind of imitate this without differentiable CoT by using policy gradient and treating the negative of the cross entropy as a reward, but this is quite a sparse reward, and probably less sample efficient.
The best language models are still getting smarter and more capable. To an increasing degree, this is because they are trained by Reinforcement Learning with Verifiable Rewards. Chain of thought reasoning allows models to evade the finite depth restriction on information flow by passing (relatively little) information back into the first layers of the model through the token stream. Although pretraining was already enough to produce decently-good chains of thought by pure imitation, RLVR allows further optimization of the chain of thought by rewarding those chains of thought that result in a verifiably correct answer. Because tasks that are far too difficult for humans can still be possible to grade, RLVR provides a way of bootstrapping to capabilities far beyond the human level.
Exact and Inexact Grading
What kinds of things can be turned into verifiable rewards? Most obviously, coding tasks where there are unit tests (& maybe performance tests) to determine if a solution was correct. Or in a similar vein, creating mathematical proofs in a formal proof language like Lean or Agda. We can use almost any RL environment that is easy enough for the model to interact with. For example, the model plays a text adventure game where there are rewards for getting certain items or reaching certain rooms.
We'll call this class of problem "exactly graded", because the reward is possible to evaluate without error. Note that the problem statement given to the model as context need not be exact at all. We can ask the model to write its code with only an informal problem description to guide it, but still grade its solutions by running unit tests.
Another class of problem is where the reward is subjective, or just difficult to compute. An extreme example of this might be asking the model to invent a joke. While this is the kind of question that could benefit from chain of thought reasoning (coming up with a really funny joke can require a lot of thought and iterating on ideas), it's very difficult to automatically evaluate whether a joke is funny. In addition, the training corpus contains few examples where people verbalize their process for coming up with jokes, so purely relying on pretraining to give us reasoning traces to imitate won't get us very far. We'll call this class of problem "inexactly graded".
One option here is to get humans to provide the grades. Too many rollouts are generated during RL for humans to label every rollout , so we usually train a reward model as an intermediary. Human graders assign rewards to a diverse set of rollouts (mixed with manually-created correct answers, perhaps). Then we train a reward model on this dataset of rewards. The loss function here is simple mean-squared error. Then we do our actual RL optimization, where the reward model assigns rewards to the each of the large number of rollouts produced by our model during training.
One problem with training a reward model as an intermediary is that it opens the door to the learner finding ways to trick the reward model into assigning a higher reward than it should. This can happen in subtle ways, making it hard for AI researchers to reliably notice and fix these issues. Also, even the smaller amount of human labelling required to train a reward model can still be expensive.
So,
Exactly graded problems:
They are relatively difficult to cheat.
However a wide variety of problems, including most alignment-flavoured problems, are not exactly graded.
In general training on these can make your model smarter, but will not make it more friendly.
Inexactly graded problems:
Trying to optimize the chain of thought means we need some kind of automated grader.
But the subjectivity of grading makes the grader subject to adversarial optimization. Cheater strategies like prompt-injecting the grader to get a higher score are incentivized.
Need to manually supply labels, but the supervision they provide is sparse: just a bunch of scalars.
So for inexactly graded problems, figuring out how to train chain-of-thought reasoning is hard. What would be really nice is if we could just make training chain-of-thought as easy as supervised fine-tuning. Instead of obtaining a diverse dataset of various kinds of answers (both good and bad), carefully grading those answers, and then training a reward model, supervised fine-tuning just asks for a dataset of answers that are known to be good. (Yes, this does kind of destroy the appealing [1] "train far beyond human intelligence" part of doing RL. Though this strikes me as less important for inexactly graded problems anyway. And it's not like RL is a mandatory prerequisite for exceeding human ability; see here.)
The problem with just treating this as a supervised fine tuning problem is that, while we want the model to reason before answering because it improves performance, we can't differentiate through the chain of thought. The random selection of tokens in the chain of thought, besides destroying a lot of information contained in the activations, is non-differentiable.
Except that if the chain of thought is neuralese, then we actually can differentiate it!
Maybe Neuralese is Safer
Usually Neuralese is considered a bad development on the axis of alignment-related concerns, because raw activation vectors are much less interpretable than an approximately-English reasoning trace. This is true. But there are other important alignment properties besides interpretabilty.
The nice thing about a pure neuralese reasoning trace is that it is completely differentiable. In standard supervised fine-tuning, we differentiate through our model in order to increase the probability it assigns to each training example. If the model now generates a neuralese reasoning trace, then we can just do the exact same thing, except that we must differentiate though many forward passes of the model, joined into a chain by their neualese vectors. The googleable term for this is backpropagation through time.
So a nice recipe for learning an inexactly graded problem is:
The dataset is a bunch of examples of good answers.
The loss function is the exact same as regular SFT (namely, cross-entropy).
The only difference is that models do many steps of neuralese reasoning before answering, and we backpropagate through the entire reasoning chain.
To be clear, backpropagation through time is not a recent idea at all. It goes all the way back to RNNs. Or see here for a modern version. But because of its ability to train problems with inexact grading, I claim this recipe should be getting more attention as a path to better-aligned models than we currently give it.
What about RL for agents?
Agents not only reason in chains, but also repeatedly use tools and interact with their environment to achieve their goals. Training models to be good agents will surely require RL. If we want to train agents to perform inexactly graded tasks, it doesn't seem we can avoid the need to train reward models. But I still think it will be helpful if we can simply backpropagate through the parts of these rollouts that are pure reasoning, instead of treating each reasoning token as an RL action. Reduce the effective trajectory length by reducing the number of things that count as actions, basically. Because the latent reasoning trace is not graded and so is hidden from the reward model, I also expect this reduces the surface area for hacking the reward model.
Are there ways to get SFT-like training of reasoning traces besides Neuralese?
Perhaps there is a way to get the best of both worlds: The interpretabilty of token-based reasoning and the simple example-based training of SFT or neuralese-SFT?
I think there are some things that can be done here, but they strike me as less simple and likely-to work than the neuralese path. I have already made my most important points, so you can skip this section, as it will be fairly unpolished.
Let be the context, including the question posed to our model. Let be the answer produced by the model, while denotes a ground-truth answer from the training data. (Note that it's not necessarily wrong if any given is very different from . We just want them to be sampled from the same distribution conditional on .) Let be the reasoning trace the model used in coming up with its answer. Let denote the parameters of our learner. During inference, we sample in the following order:
During supervised fine tuning, we have access to question-answer pairs but no reasoning trace linking the two. We'd like a training process that improves the distribution somewhat, where "improve" means that it makes the distribution closer to the empirical distribution .
The basic idea here is to train a helper model that tells us which reasoning traces make a given answer more likely to be produced.
Consider the following two-stage process:
Initialize . Train a model . (The simplest setup is just to concatenate with and generate forwards from there.) Training data for this process comes from computing rollouts using .
Initialize . We'll now train a model . Given a question-answer pair , we sample many reasoning traces . We optimize with policy gradient, where the reward for a given trace is:
This can be broken into a sum of token-level terms. We can add a KL penalty to prevent ourselves from diverging too quickly from a reasonable distribution here.
Besides being complicated, this also re-introduces an intermediate model with parameters that can potentially be gamed. But it is probably still worth trying, especially if one places a high importance on interpretability of chains of thought.
I asked ChatGPT about the history of this kind of technique and it found the following papers:
https://arxiv.org/pdf/2312.02179 The authors use Markov Chain Monte Carlo to sample chains of thought conditional on instead of training a helper model. They directly train their learner to imitate the sampled in this way instead of doing RL.
https://arxiv.org/pdf/2601.09260 This is a really smart paper. Here the latent is a continuous-valued latent vector, rather than a token-stream chain of thought, and we have a decoder . The thinking process here is that is repeatedly updated according to a learned velocity field. The velocity field is optimized to point in the direction that most increases the probability of the actual answer , relative to its probability at the current value of . This is latent reasoning (basically neuralese!), not natural-language reasoning, and so does not achieve the goal of easy interpretability. But on the plus side: It is a much cleaner idea than the token-based scheme described above.
https://arxiv.org/pdf/2602.14469 Yep. Because we condition on the true answer when generating to train on, post-hoc justification is incentivized, and is bad. IDK about their proposed solution though.
The method I describe above trains by perturbing on-policy reasoning chains towards those that are more likely to generate the supplied answer, rather than updating on some confabulated by cross-entropy. So I predict it would work better than the 2023 paper because of that. It looks worse than flow reasoning, but flow reasoning is uninterpretable, and it's unclear to me which of the two uninterpretable techniques (regular neuralese and flow reasoning) has the advantage.
End
Neuralese will overall be a good thing for alignment.
from the perspective of labs with little concern about existential risk from AI ↩︎