COT control: The Word Disappears, but the Thought Does Not

Pranjal Garg

Note: This blog describes some of the results from the pilot experiments of an ongoing work.

Introduction

Model misalignment, misbehaviour, and scheming can arguably be monitored by interpreting Chain-of-Thought (CoT) traces as the model's inner thinking process. However, CoT control constrains this monitoring, raising the risk that the proclivity to think out loud is no longer aligned with the ability to solve hard problems. The possible inspection of this sequence of linguistic states has motivated new studies on CoT controllability, recently measured by the ability of models to follow instructions related to their reasoning process rather than the final answer. In an interesting case, models were asked to avoid using certain words while answering questions that essentially require the use of those words, for instance, to avoid the word "moles" while answering a physical chemistry question. These evaluations indicated that models are strikingly weak in avoiding these 'forbidden concepts', with compliance for some frontier settings dropping to the low single digits^[1]. It is imperative for safety research that such controllability stays on the lower end. CoT monitoring depends on the assumption that models can't sanitise their reasoning traces, that what appears in the chain of thought reflects what the model is actually reasoning about. Word-level suppression instructions provide a test of this assumption. If a model can learn to avoid a word while also distancing its reasoning from the concept, that's bad for monitorability. If it can't, if the semantic trace remains close to the concept regardless of lexical compliance, that's evidence that the trace is hard to sanitise, which is good for monitoring.

I explore this with the aim of answering whether failure to control CoT is merely a lexical slip or if there is any semantic drift towards the forbidden concept. I initially hypothesised that failure could be attributed to semantic attraction towards the forbidden concept. I thought the forbidden concept might be functioning as an attractor in the model's reasoning space, and once the drift trajectory is close enough, it is committed to fail. Possibly, indicating that the lexical realisation then becomes too difficult to suppress regardless of the instruction. However, during the course of experimentation, I realised that this wasn't the case. At least, entirely.

If my hypothesis were to be true, then failures should systematically bend before the word appears, drop in semantic coherence near the failure and show deeper basin entry under prompts that amplify the concept. Additionally, there would be predictive power from pre-failure proximity features alone. However, I saw that failures looked like isolated lexical accidents with some mixed signals. The dominant pattern is not gradual attraction but an immediate drop. The leakage occurred in the first 0–2% of reasoning, the same sentence positions leaked across all three conditions, and the constraint instruction had no measurable effect on trajectory shape. Consequently, complicating this hypothesis.

My results suggest that the model's path toward the restricted concept is determined by the question's semantic structure instead of by the dynamics of extended reasoning. The failure to control the model's reasoning trace is less like gradually slipping into a basin and more like being placed inside one from the start, confirming previous findings that instruction has low to no effect. My embedding-based measure provides evidence that this holds even beyond the lexical surface. In the small fraction of traces where the model successfully avoids the forbidden word, its reasoning trajectory still dips toward the concept semantically. The word disappears, but the thought does not. This means a semantic monitor would catch what a keyword filter misses: traces that look clean but aren't. Moreover, the fact that the question's semantic structure explains over 90% of the variance, and the instruction explains none essentially, suggests that controllability failure is not a correctable prompting problem. It is baked into the relationship between the question and the concept. This makes it robust in a way that matters: a model cannot be prompt-engineered around it.

I argue that the derived trajectories in this work offer a useful complement to keyword-based compliance metrics precisely because they audit the cases that compliance metrics call successes. More broadly, I think the safety-relevant research question going forward is: what kinds of questions are most at risk of allowing genuine semantic concealment? In future work, I aim to characterise that boundary.

Methodology

I collected thinking sentences from two open-source reasoning models, Nemotron 3 Super (NVIDIA) and GPT-OSS-120B (OpenAI). They were evaluated for 150 questions (mixed domains) drawn from the CoT-Control evaluation suite.

CoT traces are proposed to traverse a geometric space that can be aligned semantically owing to high-dimensional vector representations derived from the model's embedding space^[2]. Therefore, instead of treating the reasoning as a static text artefact, it is treated as a discrete time trajectory in a latent space. This enables the possibility that the failure of models to control their thought process may be studied through the lens of the evolution of these intermediate states. To study this, I introduce Distance to Forbidden Concept (DFC), a simple trajectory-based measure, defined as one minus the cosine similarity between a reasoning step and the forbidden concept. CoT was first segmented into reasoning steps and then embedded using an external sentence transformer, yielding a one-dimensional temporal proximity signal. The reasoning trace as a dynamic object can gradually approach the forbidden concept, avoid it, or collapse towards it.

Let be a segment derived from the trace with a simple regex-based strategy defining sentence boundaries based on punctuation and capitalisation. Embeddings for each sentence were obtained with a simple sentence transformer (using all-MiniLM-L6-v2), . Similarly, each of the forbidden words was embedded by first converting to a sentence = 'This is a concept of {forbidden word}' and then the embedding,

I defined three different regimes to map the models' internal thought process: Baseline, Constrained and Amplify. They differ only in their prompting strategy. In the Baseline, models received a standard chain-of-thought prompt with no mention of the forbidden word; Constrained was explicitly instructed to avoid using the forbidden word in its reasoning; and Amplify was explicitly instructed to use the forbidden word repeatedly throughout its reasoning.

As noted previously by different authors, semantic leakage due to lexical overlap is observed when such trajectories are analysed and compared. To mitigate this, I replaced the forbidden word and close synonyms with a constant placeholder [CONCEPT] in all sentences before computing DFC, where .

To measure the reasoning coherence between adjacent steps, I also computed consecutive-sentence similarity (auto-cosine), .

Results

Controllability: Models Cannot Suppress Forbidden Concepts

Every single constrained trace contained the forbidden word or a synonym for Nemotron 3 Super, achieving 0% controllability, whereas in GPT-OSS-120B, only 14 of 150 constrained traces were clean, indicating 9.3% controllability. For both models, the compliance for the Amplification instruction was nearly perfect (100% and 98.7%, respectively). This shows a known asymmetry; models can parse positive instructions but fail when asked to suppress something.

Figure 1: Forbidden Word Frequency Per Trace: bar chart showing mean forbidden/synonym count across conditions and models.

Nemotron's 0% controllability is the lowest I am aware of, even below DeepSeek R1's 0.1% reported in the CoT-Control paper (Chen et al., 2026). It might be too early to claim, but the finding that RL-trained reasoning models show lower controllability might be extended to Mamba-MoE hybrid architectures. Probably, due to limited data used in the current discussion, for GPT-OSS-120B I observed a slightly higher controllability than the reported 4.6%.

DFC Reveals Semantic Engagement Beyond Keywords

To establish DFC as a possible legitimate tool to monitor CoT traces, after sanitisation, I inspected whether the trajectories still showed lower DFC wherever the forbidden concept was engaged. I found a significantly lower DFC, showing that embedding captures semantic engagement with the concept.

Figure 2: DFC vs Forbidden Term Frequency Per Sentence showing moderate negative correlation, that r ≈ −0.36 to −0.43, across all conditions.

Moreover, no strict monotonicity was observed with the word frequency, besides a global decreasing trend. This could mean that the semantic context surrounding the forbidden word and reasoning patterns carry some signal, beyond what a keyword filter might catch. The semantic neighbourhood of the reasoning would still reveal what the model is thinking about, making DFC, in my opinion, a more informative monitoring metric than simple keyword detection. However, semantic engagement is found to be localised to sentences that explicitly engage the concept, instead of diffusing broadly across the reasoning trace, as shown by sentences with zero forbidden-term occurrences (DFC 0.89–0.90).

The Constraint Instruction Has a Moderate Opposite Effect on Semantic Trajectories at The Local Level

Figure 3: Paired Comparison- Constrained vs Amplify Mean DFC scatter plot, with leaked/clean colouring. Each dot represents one question.

A near-perfect correlation (r = 0.86-0.90) was found during the paired comparison between the mean DFC for the same questions under constrained vs Amplify conditions, revealing that most of the variance is along the diagonal. This indicated that different questions produce different DFC for constrained and Amplify conditions, but are almost identical across both conditions, leading me to believe that at the trace level, the question's semantic structure is important and the prompt instruction plays little to no role in determining how close reasoning gets to the forbidden concept. In the unsuccessful cases, for Nemotron, Constrained exceeded Amplify on 51% of questions, whereas for GPT-OSS showed a modest 60/40 split, motivating me to look at this counterintuitive phenomenon more closely.

Figure 4: DFC Around Forbidden Word / Synonym Occurrences error bar plot showing mean DFC at -2, -1, At, +1, +2 steps relative to the banned term.

When I zoomed into a 5-step window around sentences that contain the forbidden word or a synonym, a consistent ordering emerged across both models: at every step, position. The constrained condition, where the model is explicitly told to avoid the concept, produces reasoning that is semantically closest to the forbidden concept at the point of failure, even closer than the amplify condition, where the model is told to use it freely. The effect spans the entire ±2 step window and is more pronounced in GPT-OSS-120B, where the three lines separate clearly even at the flanking positions. Despite this ordering, the shape of the dip is identical across conditions, suggesting that the dynamics of approach and recovery do not change across conditions.

The Question, Not the Instruction, Determines the Trajectory

Figure 5: What determines the semantic proximity to the forbidden concept- Question or Condition? Bar plots of the percentage of variance explained by each condition and question.

To further analyse the key determinants of controllability failure and success cases, I performed a variance decomposition of mean DFC (across all trajectories). Similar to what I observed, I found remarkable asymmetry: question identity explained nearly all the total variance (~92% for Nemotron Super and ~88% for GPT-OSS-120B) in proximity to the forbidden concept, whereas the prompting regimes used (Baseline, Amplify, Constrained), explained only 0.3% and 0.4% variances, respectively. The remaining ~7%–11% variance accounted for unmeasured factors. This showed that knowing which question the model is answering tells us almost everything about how close its reasoning will get to the forbidden concept, and knowing what instruction it received while answering tells us almost nothing. This massive asymmetric ratio between question-level and condition-level effects quantified what the earlier paired scatter plots showed: the tight clustering along the diagonal reflected the negligible influence of the prompt, while the wide spread along it reflected the enormous differences between questions.

One-way ANOVA confirms this: the question effect is highly significant for both models (p < 10⁻⁸⁴), while the condition effect is non-significant (p = 0.50 and p = 0.44 for Nemotron and GPT-OSS, respectively).

Leakage Occurs Immediately and Consistently

Figure 6: Where Does First Leakage Occur? A histogram of the first leakage position and the minimum leakage position as % of reasoning.

Both models experience the first leakage in the first 0–5% of the reasoning (median: 0% for Nemotron, 2% for GPT-OSS), whereas the minimum DFC occurs much later in the process (median: ~40-50 %). There is no gradual drift towards the forbidden concept over the course of reasoning rather, the encounter is almost immediate, as the question's semantic content directly commits it to fail. And the leakage continues throughout the reasoning steps. We can attribute much of the apparent attraction to question-level priming.

Figure 7: Sentence-Level Leakage Co-occurrence Across Conditions — heatmap.

To support that leakage is driven by the question's structure, I analysed the co-occurrence of the leakage in different regimes. I found that ~50% of sentence-level leakage co-occurs at the same reasoning position regardless of the condition. For instance, for Nemotron, I report a co-occurrence of 16.2% in Constrained with Baseline against a Baseline diagonal of 33.9%, showing that ~50% of the leaking sentences appear at the same position regardless of condition. The total leakage increased for Amplify condition (38.9% Nemotron, 38.5% GPT-OSS), but this did not change where it occurs.

First Encounter Partially Predicts Trajectory Minimum

Figure 8: First Encounter DFC vs Min DFC scatter.

The DFC at the first forbidden-concept encounter moderately predicts the trajectory minimum (r = 0.35–0.57, p < 0.001). The spread below the diagonal reveals how much the model continues drifting toward the concept after initial contact. GPT-OSS shows a tighter relationship (r ≈ 0.53–0.57); its post-encounter drift is more predictable from the initial approach distance. Nemotron is noisier (r ≈ 0.21–0.44), suggesting more erratic trajectory dynamics after first contact. Critically, the constrained condition shows a similar correlation structure to the baseline, reinforcing that the avoidance instruction does not meaningfully alter trajectory dynamics.

Paired DFC Differences Confirm Null Effect of Constraint

Figure 9: Paired DFC Differences histograms — Baseline − Constrained and Baseline − Amplify.

When paired, each question's baseline and constrained mean DFC, the distribution of differences is centred near zero with slight variation. The constraint does not systematically push reasoning away from the forbidden concept relative to the baseline. The Baseline − Amplify distribution is similarly centred, consistent with the finding that Amplify produces comparable trajectories to baseline (since both conditions engage the concept freely).

Reasoning Coherence Drops Before Leakage

Figure 10: Auto-Cosine- Overall vs 3 Steps Before encounter.

Consecutive-sentence similarity (auto-cosine) drops by ~13% in the three steps preceding the first forbidden-concept encounter, consistently across both models and all conditions, including baseline. This suggests that reasoning becomes less coherent, or more "jumpy" between topics, as it approaches the forbidden concept, potentially offering an early warning signal for monitoring.

However, since this pattern appears equally in unconstrained baseline traces, it likely reflects general topic-narrowing behaviour (the model converging on a specific aspect of the problem) rather than a controllability-specific phenomenon.

Surface Compliance Does Not Completely Produce Semantic Avoidance

Figure 11: Clean vs Leaked Constrained Traces (GPT-OSS-120B)

GPT-OSS-120B produced 14 constrained traces, 9.3% of the total, that contained no forbidden words or synonyms. A keyword filter would label these fully compliant. However, their DFC profiles tell a different story. While clean traces maintain a slightly higher mean DFC than leaked ones (~0.90 vs ~0.85), their minimum DFC drops to ~0.78, well below the no-engagement baseline of 0.89–0.90. This means that at some point during reasoning, every single "compliant" trace still approaches the forbidden concept semantically. The model avoided the word but did not avoid the thought. That said, clean traces do stay more distant than leaked ones at their closest approach (~0.78 vs ~0.65), so surface compliance is not entirely cosmetic, but it falls far short of genuine semantic avoidance.

Conclusion

Overall, I confirmed the recently reported very low controllability in CoT with an embedding-based measure. I found that controllability is more robust than originally feared and that models, even though they can avoid certain words when asked, cannot evade the "thought" of it, leaving a geometric footprint. The semantic structure of the question appears to be the fixed point in the embedding space that is causing the CoT control to lapse, especially in the case that is measured here (i.e., to avoid certain words). The finding that suppression primes the model more than the amplification instruction shows a similarity with cognitive science findings. When humans are told "don't do X", the representation of X is systematically activated more than when instructed to "do X", implicating negative instructions as activators, not inhibitors. If this generalises, it can be potentially impactful for any 'negative' prompting-based strategy for representation level alignment. I observed clear boundaries for the DFC values of these prompting regimes, and I suppose that in future, this can be exploited to audit traces where controllability appears low on the surface but remains incomplete semantically. Finally, I found that the trajectories are shaped during encoding rather than by extended reasoning, which suggests that CoT may function more like decompression of information already present in the question's representation than like genuine deliberation and if so, then the content is structurally entailed by the input. The safety-critical frontier is now identifying which question types permit genuine semantic concealment.

Acknowledgement- I would like to thank Cheston Tan for reading the blog and his comments.

^{^}
Yueh-Han, Chen, et al. "Reasoning Models Struggle to Control their Chains of Thought." arXiv preprint arXiv:2603.05706 (2026). https://arxiv.org/abs/2603.05706
^{^}
Zhou, Yufa, et al. "The geometry of reasoning: Flowing logics in representation space." arXiv preprint arXiv:2510.09782 (2025). https://arxiv.org/pdf/2510.09782