Epistemic status: I think the main point of this post is probably (~80%) false, and there are probably more counterpoints I haven't thought of. I wrote the rest of the post as if my claims are true for ease of reading. I would appreciate it if you told me where my arguments are wrong!
Latent reasoning models (LRMs, popularized by Meta's Coconut paper, which was improved on a lot by CODI) do CoT thinking in the model's latent space by skipping the LM head that maps from d_model-vectors to a distribution over tokens, and instead just feeding the activations right before that back into the model as the input embedding at the next token position. There aren't any large-scale LRMs currently; the best public LRMs are GPT-2 scale and specialized for narrow tasks. LRMs might be better for interpretability and safety than models that reason in text CoTs, at least for extremely large transformative AI systems.
Why they might be good for safety
tl;dr: direct CoT interp will eventually fail anyways; LRM activations are easier to interpret because they compress whole thoughts into single tokens.
Latent reasoning models will be better for safety as we scale up to larger models. They might also be more capable (and hence more unsafe just because it's more intelligent), but here I'm only claiming that a LRM that's equally capable as a normal model is safer.
CoT faithfulness and monitorability is a fragile property of current models that probably won't last as we scale up to the transformative AI that actually matters. I don't think we'll see a lot of (intentional) training on CoTs, but optimizing the output can indirectly change how the model generates CoTs (because the same weights are used for thinking and output), so optimizing the model to talk in a certain way indirectly affects thinking too. Relying on being able to read CoTs for scheming is already not very effective, and transformative AI models will probably think in uninterpretable neuralese anyways as they get heavily optimized.
Latent reasoning models are probably less affected by optimizations on output text spilling over to affecting reasoning (because the reasoning tokens don't look like output tokens since they have to encode complex ideas in one token), because the thinking tokens are already very different from output. LRM thinking is also designed to be compressed: it tries to replace whole sentences in text-CoTs with single latent thought tokens that capture the entire thought. This means we're trading off several-token sentences of neuralese for a single-token embedding that encompasses one direct thought.
I think if we get good at interpretability, extracting what the model is thinking about from a token that compresses a thought into one token will be easier than trying to understand several tokens of neuralese that all attend to each other. AOs for scaled LRMs (which might be able to be trained based on how the LRM is trained, or maybe Anthropic's secret unpublished technique) might be pretty effective at textualizing latent thoughts, and giving more information about them than a plain-text CoT would.
Aren't polysemantic thoughts bad?
A big assumption I make here is that it's better to compress thoughts into less tokens. This probably will result in polysemantic thinking tokens, which encode many ideas in one embedding. This might be a good thing! I think it might be better to compress thoughts into one token that can be interpreted in isolation, instead of having to interpret a thought that's spread out over several text tokens that all influence and attend to each other. I think this is the weakest point of my argument though; I'm pretty uncertain if polysemantic thinking tokens are good or not. It would definitely be important to build tools to interpret complex polysemantic activations of LRMs in the world where LRMs are scaled (we have SAEs, but there are problems with those), although I think this is probably important either way?
Latent reasoning is kinda weird
So latent reasoning models are actually pretty weird? LRMs have the same model architecture as non-LRM, so they have to use the same weights for both generating output and storing their interim thoughts, which seem like pretty different tasks to me. Non-LRM CoT-trained models also have this problem to a lesser extent: they have to use the same weights for both writing CoT tokens and output tokens despite those two things having pretty different distributions.
Architecture-level thinking separation might be good
One way to prevent optimizing on the output from affecting the thinking is to use different weights for the thinking and the output, which is kinda vaugely what Shoggoth+Face+Paraphraser proposes (without the paraphraser). The thinking tokens of LRMs are in a different part of the latent space than output tokens; this is good for separating thinking/output. But we might be able to do even better by having different weights (not necessarily entirely separate models, it might be enough to only have different weights for the last n layers) in the model for thinking and output, especially for LRMs where the embeddings are already not expected to be in the same distribution as output.
Other thoughts
LoopLMs are conceptually similar to latent reasoning models, but instead they repeat the reasoning layers multiple times. I'm not familiar enough with them to know if this is good.
Current models already do a lot of latent thinking and aren't really forced to make the CoT look anything like their "real" thoughts. I'm not sure what to think of this.
I have a bit of an intuition that LRMs are less likely to emergent misalignment problems; I haven't thought this through much though.
Epistemic status: I think the main point of this post is probably (~80%) false, and there are probably more counterpoints I haven't thought of. I wrote the rest of the post as if my claims are true for ease of reading. I would appreciate it if you told me where my arguments are wrong!
Latent reasoning models (LRMs, popularized by Meta's Coconut paper, which was improved on a lot by CODI) do CoT thinking in the model's latent space by skipping the LM head that maps from d_model-vectors to a distribution over tokens, and instead just feeding the activations right before that back into the model as the input embedding at the next token position. There aren't any large-scale LRMs currently; the best public LRMs are GPT-2 scale and specialized for narrow tasks. LRMs might be better for interpretability and safety than models that reason in text CoTs, at least for extremely large transformative AI systems.
Why they might be good for safety
tl;dr: direct CoT interp will eventually fail anyways; LRM activations are easier to interpret because they compress whole thoughts into single tokens.
Latent reasoning models will be better for safety as we scale up to larger models. They might also be more capable (and hence more unsafe just because it's more intelligent), but here I'm only claiming that a LRM that's equally capable as a normal model is safer.
CoT faithfulness and monitorability is a fragile property of current models that probably won't last as we scale up to the transformative AI that actually matters. I don't think we'll see a lot of (intentional) training on CoTs, but optimizing the output can indirectly change how the model generates CoTs (because the same weights are used for thinking and output), so optimizing the model to talk in a certain way indirectly affects thinking too. Relying on being able to read CoTs for scheming is already not very effective, and transformative AI models will probably think in uninterpretable neuralese anyways as they get heavily optimized.
Latent reasoning models are probably less affected by optimizations on output text spilling over to affecting reasoning (because the reasoning tokens don't look like output tokens since they have to encode complex ideas in one token), because the thinking tokens are already very different from output. LRM thinking is also designed to be compressed: it tries to replace whole sentences in text-CoTs with single latent thought tokens that capture the entire thought. This means we're trading off several-token sentences of neuralese for a single-token embedding that encompasses one direct thought.
I think if we get good at interpretability, extracting what the model is thinking about from a token that compresses a thought into one token will be easier than trying to understand several tokens of neuralese that all attend to each other. AOs for scaled LRMs (which might be able to be trained based on how the LRM is trained, or maybe Anthropic's secret unpublished technique) might be pretty effective at textualizing latent thoughts, and giving more information about them than a plain-text CoT would.
Aren't polysemantic thoughts bad?
A big assumption I make here is that it's better to compress thoughts into less tokens. This probably will result in polysemantic thinking tokens, which encode many ideas in one embedding. This might be a good thing! I think it might be better to compress thoughts into one token that can be interpreted in isolation, instead of having to interpret a thought that's spread out over several text tokens that all influence and attend to each other. I think this is the weakest point of my argument though; I'm pretty uncertain if polysemantic thinking tokens are good or not. It would definitely be important to build tools to interpret complex polysemantic activations of LRMs in the world where LRMs are scaled (we have SAEs, but there are problems with those), although I think this is probably important either way?
Latent reasoning is kinda weird
So latent reasoning models are actually pretty weird? LRMs have the same model architecture as non-LRM, so they have to use the same weights for both generating output and storing their interim thoughts, which seem like pretty different tasks to me. Non-LRM CoT-trained models also have this problem to a lesser extent: they have to use the same weights for both writing CoT tokens and output tokens despite those two things having pretty different distributions.
Architecture-level thinking separation might be good
One way to prevent optimizing on the output from affecting the thinking is to use different weights for the thinking and the output, which is kinda vaugely what Shoggoth+Face+Paraphraser proposes (without the paraphraser). The thinking tokens of LRMs are in a different part of the latent space than output tokens; this is good for separating thinking/output. But we might be able to do even better by having different weights (not necessarily entirely separate models, it might be enough to only have different weights for the last n layers) in the model for thinking and output, especially for LRMs where the embeddings are already not expected to be in the same distribution as output.
Other thoughts