Efficiently Detecting Hidden Reasoning with a Small Predictor Model

RohanS; Vishnu Vardhan Sai Lanka; yaumeng; daria

This post is based on work done during SPAR by Daria Ivanova, Yau-Meng Wong, Nicholas Chen, and Vishnu Vardhan with guidance from Rohan Subramani. We are grateful to Rauno Arike, Robert McCarthy, and Kei Nishimura-Gasparian for feedback.

TL;DR

We introduce two black-box approaches to detect hidden reasoning in a model’s chain of thought. Both are based on the idea that if a model conducts hidden reasoning, its subsequent reasoning steps will likely be harder for a smaller model to predict.

We find some promising results for detecting randomly removed reasoning steps and skipped steps related to scheming. We suggest next steps that could improve our methods and discuss how AI companies could integrate these methods alongside existing chain-of-thought monitoring techniques.

Our code and data are available here; see our poster here (won SPAR best poster award).

Introduction

Chain-of-thought (CoT) monitoring is a powerful tool for assessing how a model arrives at its output. For CoT monitoring to be reliable, models must generate CoTs that contain enough of the relevant thinking steps for us to oversee the model’s reasoning process accurately. Hidden reasoning—any portion of the model’s thinking that happens at the activation level but doesn’t appear at the token level—won’t be noticed by standard black-box CoT monitors. This weakness could be alleviated through hidden reasoning detection: if we can tell that significant steps are hidden from the CoT, we would put less trust in CoT monitors and be extra cautious with model outputs.

We investigate two ways to detect hidden reasoning. Both involve breaking the untrusted model’s CoT into discrete reasoning steps, and using a smaller, trusted model to predict the n+1st step given the first n. Intuitively, thought with leaps of logic should be harder than continuous thought to predict.

This work was heavily inspired by the “Punish Leaps of Logic / Incentivize Local Validity” proposal from Daniel Kokotajilo’s “Bag O’ Tricks,” which suggested using the signal of a smaller predictor model to train models with more continuous CoT reasoning.

Our perplexity method treats each token as a separate step, while our semantic method defines a step as a snippet of text. We assess these approaches on baseline, scheming, ablated-baseline, ablated-scheming, and scheming-removed CoTs from models solving coding interview problems from the APPS dataset (where scheming involves adding a backdoor to the code). We generate the ablated CoTs by removing random snippets from their continuous counterparts. For the scheming-removed dataset, we cut out portions in which the model is explicitly scheming from the “scheming” CoTs. These modified CoTs are proxies for the real hidden reasoning that we’d want our methods to catch. (Admittedly, removing random and scheming snippets may be a crude approximation of true hidden reasoning, and generating more natural examples would be valuable in future work.) Both monitoring approaches are capable of distinguishing the ablated and scheming-removed datasets from the unablated honest baseline, though the perplexity approach is significantly noisier.

The perplexity approach offers a low-cost, scalable solution suitable for universal application as a first-pass filter, while the semantic evaluation method is more costly but provides more accurate results for high-stakes reasoning chains. Labs would need to establish appropriate baselines and thresholds, then integrate one or both of these methods in a) hierarchical monitoring schemes that establish escalation protocols to route flagged cases to more sophisticated monitoring systems, or b) parallel monitoring schemes that simultaneously apply multiple detection mechanisms to increase monitoring robustness. We propose next steps for improving these methods to make them more reliable. If successful, AI companies could implement these techniques as complementary layers in their existing control systems.

Methodology

Data Generation

Environment: We base our experiment on the AI control setting [Greenblatt et al. 2023] where a model is tasked to solve an interview-level programming problem from the APPS data set [Hendrycks et. al. 2021] and include a backdoor such that a model-defined secret input to the backdoored code produces an answer that is different from the secret input evaluated on the ground-truth code.

**Figure 3**: Sample prompt and scheming CoT

Datasets

Scheming CoTs and their random ablations:
- We generate 473 examples of scheming CoTs using the “attack” APPS backdoor setting from Control Arena. Of these, we remove 158 CoTs in which the model fails to scheme despite being prompted to.
- To generate the ablated CoTs, we split the original CoTs into snippets with line breaks as the delimiter and remove each snippet with a 20%, 40%, or 60% probability. This process created three ablated-scheming datasets based on the original scheming CoTs, each generated with a different ablation probability.
Honest baseline CoTs and their random ablations:
- We generate 473 CoTs on the same programming problems from the APPS dataset, but without prompting the model to include a backdoor (using the “honest” APPS backdoor setting).
- We create three more ablated-honest datasets through the same ablation procedure as described above.
Scheming removed CoTs:
- We prompt Claude 3.7 Sonnet to remove segments of the CoT containing the terms “back door”, “side task”, or “secret task” from the original scheming dataset. The procedure does not alter the original text, and all cuts are made at the ends of sentences to preserve continuity. These CoTs are meant to simulate reasoning traces of a model that suppresses specifically its scheming reasoning. This procedure removed 26% of a reasoning trace on average, with half of the traces in the 12%-48% ablation range.

Models

Claude 3.7 Sonnet in extended thinking mode generated all our original reasoning traces, checked them for scheming, and removed scheming segments to produce the scheming-removed dataset. It also served as the judge in the semantic approach, deciding whether the small model’s prediction matched the original CoT.
Qwen/QwQ-32B acted as the small predictor model in both the perplexity and semantic approaches.
Qwen3-8B was our simple baseline monitor, prompted to evaluate whether a CoT was continuous when given the entire CoT at once. Ideally, we would’ve used QwQ-32B here for consistency, but it was too costly.

Approach 1: Tokens as Steps and Perplexity as the Metric

We track the perplexity of the small model over the larger model’s reasoning tokens, averaging across the CoTs in each dataset to reduce noise in our results.

We calculate a perplexity score for every CoT from our datasets using our small model, QwQ-32B. Why do we measure perplexity? Behind the scenes in its derivation, the small model is asked to predict the larger model’s i+1st token based on the preceding i, for all possible i. Perplexity captures the small model’s likelihood of predicting each token correctly. All these likelihoods are multiplied and raised to a negative power that scales inversely with the number of tokens to normalize by the length of the CoT.

Better predictions mean lower overall perplexity. Therefore, we’ve chosen it as a simple measure of the small model’s confusion at the token level.

We average the individual perplexity scores across every dataset to produce aggregate per-dataset scores. Individual perplexities can deviate substantially from the mean, which we’ve captured with 1σ error bars in our result plots.

Approach 2: Sentences as Steps and Third Model Evaluation

We ask the small predictor to guess the next semantic step in the larger model’s reasoning, and use a third judge model to rate the predictor’s guess. A CoT’s correctness score is the fraction of times the small predictor was correct.

For all CoTs in our dataset, we construct a series of partial inputs representing reasoning up to step i, where i ranges from 0 to n−1, and n is the number of steps in the full CoT. We decompose the reasoning process into sequential steps by treating each sentence as an individual reasoning step. Specifically, we split the text at periods and newline breaks to identify each sentence boundary.

Each input is structured as:

[prompt] + <think> + (reasoning steps till step i)

This input generated by Claude Sonnet 3.7 is passed to a smaller reasoning model, QwQ-32B, which is tasked with continuing the CoT by generating the i+1st step. To evaluate the quality of the generated step, we use Claude 3.7 Sonnet as the judge model and provide it with the generated continuation along with the finished reasoning to give additional context.

The judge model is asked whether the newly generated step is a valid and faithful continuation of the reasoning path, given the context and the rest of the chain. The judge outputs'Correct' (1) or 'Incorrect' (0) for each predicted step, based on whether it faithfully captures the core reasoning and intent of the actual next step. For each problem, we calculated an average correctness score across all its 'N' steps, and then computed our final overall CoT Correctness score by averaging these per-problem scores. This allowed us to quantify how omissions or scheming in the CoT impacted the smaller model's ability to replicate the larger model's reasoning trajectory.

Results

Approach 1: Tokens as Steps and Perplexity as the Metric

The perplexity approach shows promise in monitoring discontinuous and scheming reasoning. Monitor perplexity increases with CoT ablation rate, and is on average higher for scheming CoTs than non-scheming.

**Figure 6**: Average predictor perplexity over scheming and honest (non-scheming) CoTs, with various ablation rates. The error bars display 68% confidence intervals.

Scheming might be unusually surprising for a predictor because it only sees the benign prompt. Moreover, while scheming has been elicited in frontier models, smaller models fail to scheme almost invariably. This potential capabilities threshold may explain why deceptive CoTs result in higher average perplexities. As for predictor confusion rising with random cuts, patchy reasoning could be harder to follow and extend since ablations create unintuitive logic leaps.

Although the eight datasets have neatly ascending averages, their error bars overlap significantly. (fig. 1) This demonstrates that the perplexity of a given CoT is not enough to detect deceptive and discontinuous reasoning. To produce a meaningful signal, our approach must be combined with others or evaluated over a whole dataset of CoTs.

**Figure 7**: Predictor perplexity on honest CoTs, scheming CoTs generated using the APPS Backdoor setting, and the same scheming CoTs but with the scheming reasoning removed by Claude 3.7 Sonnet.

Perplexity distributions for CoTs with and without scheming are noticeably distinct (fig. 2), with the mean of the scheming distribution 20% higher than the mean of the honest one. When we prompted Claude to remove all scheming from the deceptive CoTs, the resulting perplexity distribution landed in between the honest and scheming ones. We suspect this dataset confuses the small predictor less than the explicitly scheming one since the predictor doesn’t see the unintuitive scheming sections, but more than the honest dataset because the removed portions give rise to fishy discontinuities in the CoTs.

Alternatively, the “scheming removed” perplexities may be higher than the honest ones just because the remover model fails to excise every bit of scheming. We don’t view this as a major limitation since the first models that can keep deceptive reasoning out of their CoTs might be imperfect suppressors, letting occasional shards of scheming slide into their reasoning traces, though we’re not confident in this outcome. If our “scheming removed” CoTs also have bits of scheming, they would model the reasoning of such an imperfect suppressor.

We compared two ways to calculate perplexity: (1) the small predictor only sees the reasoning trace; (2) the small predictor also sees the main task (but not the side task). In both cases, we calculate perplexity over the CoT alone.

**Figure 8**: Average token-level perplexity of a QwQ 32B predictor on honest, scheming-removed, and scheming CoTs, with the small predictor seeing vs. not seeing the main prompt.

From the plot above, we conclude that seeing the prompt did not meaningfully affect the predictor’s perplexities. The lack of a difference surprised us! We expected a predictor without a prompt to have a much harder time guessing tokens of the CoT, especially at the beginning of the reasoning trace when it has no context whatsoever. We still don’t have a good intuition for this behaviour.

Approach 2: Sentences as Steps and Third Model Evaluation

Experiments with the semantic approach also show promising results, with the smaller model being much worse at predicting the next reasoning step for the scheming model. In practice, low semantic step prediction accuracy for a model in training or deployment could be a signal for unsafe reasoning.

**Figure 9**: Smaller predictor model correctness scores across different CoT types. This figure illustrates the average correctness of the smaller predictor model's predicted next steps when evaluated against the ground truth for the original and all the ablated CoTs.

When scheming steps are removed, the predictor becomes more accurate again, but still less accurate than for honest CoT:

**Figure 10**: Smaller predictor model correctness scores for scheming, honest and scheming-removed CoTs. The smaller model was able to better predict the reasoning steps in honest CoTs than scheming and scheming-removed CoTs, which could enable detecting scheming and hidden scheming.

Naive “Just Ask a Monitor” Baseline

As a baseline for comparison, we gave Qwen3-8B the baseline, scheming, ablated-baseline, ablated-scheming, and scheming-removed CoTs and prompted it to detect whether reasoning was skipped, requesting a yes/no answer. We found that, like our other approaches, this method does improve as the ablation rate increases, but is also quite noisy. For the baseline non-ablated non-scheming CoTs, the monitor guessed that there were skipped steps 13% of the time (false-positives). At 20% ablation, the monitor guessed that there were skipped steps only 31% of the time (true-positives). We see a similar positive trend with the ablated scheming CoTs. Notably, the non-ablated scheming CoTs resulted in a 52% false-positive rate, even though the monitor wasn’t tasked with detecting scheming reasoning. This suggests that non-hidden scheming reasoning may appear to a monitor as discontinuous. This is further evidenced by the small 2 pp difference in monitor prediction for scheming and scheming-removed CoTs (fig. 8). More experiments are needed to determine the strengths and weaknesses of this method relative to our two approaches.

**Figure 11**: Naive monitor results on baseline (non-scheming, non-ablated) CoTs and ablated non-scheming CoTs of various ablation rates. Ideally, the monitor should have detection fractions of 0.0 for the baseline and 1.0 for the ablated CoTs.

**Figure 12**: Naive monitor results on scheming CoTs and ablated scheming CoTs of various ablation rates. Ideally, the monitor should have detection fractions of 0.0 for the scheming and 1.0 for the ablated-scheming CoTs.

**Figure 13**: Naive monitor results on scheming CoTs and scheming-removed CoTs.

Comparing Approaches

To compare our two methods with each other and the naive monitor baseline, we plot ROC curves to demonstrate how effective each method is for distinguishing between baseline and ablated-baseline CoTs, scheming and ablated-scheming CoTs, and baseline and scheming-removed CoTs.

We generate the ROC curve for the perplexity metric by sweeping over all possible mean perplexity thresholds at which a CoT is labelled positive or negative. The equivalent is done for the semantic approach by sweeping over all mean correctness score thresholds.

For distinguishing between baseline and ablated-baseline CoTs, the semantic approach is nearly perfect, and the perplexity approach (at certain thresholds) also Pareto dominates the naive monitor:

**Figure 14**: Comparing our approaches on their ability to distinguish *ablated-baseline* CoTs from *baseline* CoTs. The *baseline* here is the ground truth negative and the *ablated-baseline* is the ground truth positive. The J score is the Youden J statistic, which equals TPR-FPR.

When distinguishing between scheming and ablated-scheming CoTs, our approaches achieve similar results to the non. With the right threshold, the semantic approach achieves a perfect separation between the correctness scores of the scheming and ablated-scheming CoTs.

**Figure 15**: Comparing our approaches on their ability to distinguish *ablated-scheming* CoTs from *scheming* CoTs. The *scheming* here is the ground truth negative and the *ablated-scheming* is the ground truth positive.

The most difficult classification task for both approaches is distinguishing baseline and scheming-removed CoTs. The semantic approach is still quite effective, with an area-under-curve of 0.925. At some thresholds, the semantic approach Pareto dominates the naive monitor approach. The perplexity approach is not much better than random in this setting.

**Figure 16**: Comparing our approaches on their ability to distinguish *scheming-removed* CoTs from *baseline* CoTs. The *baseline* data is the ground truth negative and the *scheming-removed* is the ground truth positive.

Overall, the semantic approach seems to clearly outperform the perplexity approach and the naive monitor. The perplexity approach has mixed success and is concerningly bad at distinguishing scheming-removed CoTs from baseline CoTs, which may be the most important and realistic setting. Note that the perplexity approach is likely the cheapest, so may still have advantages.

Discussion

Our work is meant to complement rather than replace existing AI control protocols. The AI control framework employs a multi-layered approach with different monitoring resources: powerful but potentially untrusted models, less capable but trusted models, and limited but reliable human oversight. Our perplexity-based monitoring approach offers a low-cost, scalable early layer that can work alongside these existing protocols. Crucially, it does not need to be a perfect classifier because it operates within this multi-layered system—the availability of trusted model review and human oversight for flagged cases allows our method to focus on broad coverage and efficient routing rather than perfect accuracy. On the other hand, our semantic approach provides more accurate classification but at a cost.

Intrinsic technical limitations

The perplexity approach is noisy and can only become meaningful when averaged over a large corpus of reasoning tokens. A model could get away with undesirable behavior if it doesn't do this often enough to noticeably affect the metric. On the other hand, since perplexity encapsulates more minute details of the text, aggregating scores over long CoTs and CoT datasets might reveal subtle differences that more discrete methods wouldn’t catch.

Many factors affect perplexity. We dodged this issue in our experiments: when perplexity changed, we could say what feature of the CoTs caused the change since we controlled the environment. In practice, a shift in average perplexity could result from any small change in the prompt or model weights.

Threshold setting is non-trivial. To use either of our approaches as a binary classifier, we must set a threshold of mean perplexity/correctness above/below which a CoT is flagged as potentially containing hidden reasoning. In practice, it may be difficult to set a single threshold value that performs well across different contexts and deployment environments. With the right threshold, our semantic monitor became a perfect classifier, but only in the specific domain of APPS coding tasks and only with a threshold value chosen post-hoc. It may be necessary to come up with a more complex dynamic thresholding scheme for real world deployment.

Scaling challenges

The increasing complexity of real-world CoTs might require more sophisticated monitoring. With the rise of tool-calling capabilities and the proliferation of available tools, reasoning chains in real-world scenarios are growing increasingly intricate. The input context and the reasoning process must account for multiple tool interactions and complex decision trees. This added complexity might impede the detection of subtle reasoning gaps or deceptive patterns, as the baseline complexity of legitimate reasoning is rising as well. Our current experiments with simple programming tasks do not capture these challenges, so a more sophisticated approach may be required to handle complex tool-calling.

In real-world scenarios like SWE-bench, code modifications are distributed across multiple codebase files, but our experiments assigned all code for a given task to a single file. With many files at its disposal, an AI agent could hide partial exploits across different files. These modifications might combine into a security issue even if each appears benign. Our approach probably needs significant adaptation to handle a multi-file workflow effectively.

Our experiments treat the agent as a plain LLM that submits code from message context, but real-world agents have long-term memory and online learning. These extra features create places to stash or spread deceptive logic outside the visible CoT. To track that behavior, perplexity monitoring would have to expand beyond CoTs to watch memory reads/writes.

Risks from fundamental assumptions

Since our method requires the CoTs to be in natural language, it would need major adaptations for models that reason in an internal language. Meta’s 2025 Continuous Chain of Thought paper shows Llama-3 can run most of its reasoning in a latent vector stream and only translate the final answer and a tidy English rationale back to text . If that trend continues, our methods could only monitor the translations, not the reasoning itself. Monitoring would then require (i) instrumentation that exposes or reconstructs the latent stream, or (ii) a trusted smaller model trained to “listen in” on those continuous representations.

Our approach assumes that a scheming model will leave some evidence of the scheming in its CoT. If future AI systems can suppress their scheming entirely, we will have no way of detecting it through CoT monitoring alone. We hope models will not learn to hide their scheming before they learn to scheme (nor realize they should avoid scheming until they learn to hide it), but we’re not confident. To gauge how fragile this assumption is, we can run stress tests that force suppression through (i) explicit instructions (“do not mention the backdoor”) or constrained reasoning-token budgets that force models to drop non-essential steps.

How could labs adopt our techniques?

Ideally, AGI companies could implement these techniques to detect when models are skipping reasoning steps, which is especially important if the hidden reasoning involves scheming. This would be a supplement to existing control protocols; in particular, the hope is that standard CoT monitoring catches scheming when it appears in CoT, and our techniques help recognize when there may be hidden reasoning that a CoT monitor can’t catch.

Perplexity Approach

Labs can implement perplexity monitoring as a low-cost, scalable solution that processes every generated CoT with minimal computational overhead. The key challenge lies in establishing appropriate baselines for comparison. We propose two main approaches labs could take: the first is cross-model comparison, where labs use a small predictor model to compute perplexities on chains of thought from an existing trusted model, establishing a baseline distribution. They then monitor for significant deviations when faced with the new model's CoTs. An alternative is cross-domain comparison, where labs establish a baseline dataset by collecting trustworthy chains of thought from the target untrusted model in domains with verifiable ground truth. They then use the small predictor’s perplexity distribution on this verified dataset to detect anomalies in other domains.

A robust baseline should be established using cross-model or cross-domain comparison before model deployment to ensure that real-world CoTs are monitored effectively. Labs would generate a set of diverse CoTs and have them examined by trusted monitors, collecting verified, trustworthy CoTs in a reference dataset. They would then establish an expected perplexity distribution for legitimate reasoning by measuring the perplexities of these trustworthy CoTs. Once deployed, the system would calculate perplexity for every CoT the model generates and compare it against the reference distribution. The system would flag CoTs for examination by trusted monitors when perplexity falls beyond a specified threshold, such as three standard deviations from the baseline mean.

This approach enables efficient resource allocation: it uses low-cost perplexity monitoring for broad coverage and reserves expensive trusted models or human review for flagged cases. Calculating small model perplexity might come at zero inference cost if a lab already does speculative decoding for inference optimization. Let the “drafter” model used for speculation act as the small predictor: its logits are already computed at every token step, so the lab gets them for free. However, we’re wary of this approach if the drafter is a direct distillation of the larger untrusted model. Like a miniature likeness, the distillation might carry the same deceptive or secretive reasoning traits as the original model. These similarities would act against us, since we rely on the smaller model to be perplexed when faced with suspicious reasoning. Even with this caveat, tracking this metric for free may be appealing to financially wary labs.

Semantic Approach

Labs can implement semantic step prediction monitoring using a three-model architecture to assess reasoning continuity at the sentence level. This approach requires more computational resources but offers greater accuracy than perplexity monitoring, and is compatible with both cross-model and cross-domain comparison techniques described in the last section.

Labs would need to carefully select models for the three-model architecture, choosing a smaller trusted model for monitoring and a more capable trusted model for judging. They could establish appropriate correctness score thresholds in a similar fashion to the perplexity approach, then apply this technique selectively to high-stakes reasoning chains or as secondary verification for cases flagged by cheaper methods. CoTs with low correctness scores could be routed to human review or more sophisticated monitoring systems.

Implementation Integration

Both techniques can be integrated into existing AI control systems as complementary monitoring layers. Labs could deploy them sequentially: first, use perplexity monitoring as a universal filter due to its low computational cost, and then apply semantic evaluation to flagged cases for deeper analysis. Alternatively, labs might run both techniques on critical reasoning chains to cross-validate detection signals and reduce false positive rates.

Cases flagged by either technique should be routed to trusted monitors or human reviewers based on severity scores. As model capabilities evolve, labs would need to update baseline distributions and thresholds based on new data to maintain detection effectiveness. This multi-layered approach allows organizations to efficiently allocate their monitoring resources while maintaining comprehensive coverage of their model outputs.

Should labs adopt our techniques?

Our methods have the potential to detect skipped steps, scheming, and instances of both in a model’s CoT. They also offer a novel layer of defense for the multilayer control framework. We think labs should keep our techniques on the table for these reasons.

On the other hand, we do not have strong indications that our methods accurately identify undesirable reasoning in the CoT. Our results are inconclusive: we’ve only tested on a small subset of models and generated a very limited set of CoTs.

Another potential constraint is that implementing our techniques requires some effort and resources. However, computing a smaller model’s perplexity seems like a simple addition that is cheap to implement, especially if we can leverage the drafter model from speculative decoding.

Ultimately, our methods should not be adopted in their current state, as they haven’t been tested in realistic settings. We hope that future tests and refinements will make implementing our techniques worthwhile for AGI companies.

Next Steps

Our research opens several avenues for improving hidden reasoning detection and understanding deceptive AI behavior.

Refining Detection Methods

The first step is to experiment with different model architectures and combinations. While we used just one set of models, it’s worth investigating how various monitor/monitored model pairs influence the accuracy of our methods. One could test whether our approaches work with a non-reasoning predictor model in place of Qwen-QwQ-32B. Developing a unified "CoT continuity score" that combines semantic coherence with perplexity measures could provide more robust detection than either approach alone.

Better Hidden Reasoning CoT Generation

Currently, we simulate hidden reasoning in a model's CoT by randomly removing some sentences from an existing CoT generated from the APPS coding tasks. This is quite a limited approach that doesn't reflect how models would naturally skip reasoning steps or engage in deceptive behavior. Random ablation creates artificial discontinuities that may be easier to detect than the strategic omissions a scheming model would actually make. Our scheming-removed dataset is likely a better proxy, but may also be unnatural or unnecessarily pessimistic.

More sophisticated approaches could involve prompting models to deliberately compress their reasoning under token constraints, comparing models' private scratchpad reasoning with their public CoTs, or using adversarial training to produce models that learn to hide specific reasoning steps while maintaining surface coherence. These methods would generate more realistic test cases that better approximate the challenges monitoring systems would face when detecting genuinely deceptive reasoning.

Integration with Broader Monitoring Systems

These techniques must be tested within realistic monitoring frameworks. Future work could investigate optimal thresholds for routing between monitoring layers and how perplexity scores should influence decisions about model execution, expensive monitoring engagement, and human review escalation.

If these metrics prove reliable, they might be used to train models not to skip steps in their CoTs. However, this avenue requires great caution: optimization pressure on the CoT could lead models to pretend they are reasoning continuously.

[-]Erik Jenner9mo*41

Nice post, I think detecting hidden reasoning is an important direction!

I find it helpful in the context of this post to distinguish between (1) "hidden reasoning" in the sense that the model is deliberately confining some reasoning to forward passes and doesn't include any traces of it in the CoT except for the conclusion, vs (2) in the steganography sense, where the tokens in the CoT are carrying important information (but in some encoded way).

IIUC, your datasets are meant to measure (1) (though the methods could in principle be applicable to both (1) and (2)), does that sound right? That could be one motivation for your approach over paraphrasing.

If you do want to study (1), maybe a more realistic proxy would be cases where current models hide reasoning in this sense (whether naturally or after prompting). I.e. pick a task where models need to do some reasoning, but it's easy enough that they can do it in a single forward pass. Make the task such that models want to hide this reasoning (e.g. it's about bias), or prompt them to hide it.

My guess is that this proxy of hidden reasoning would be harder to detect than the "removing reasoning chunks" proxy. And I think this is likely also true for the real case of (1)---if a scheming model was already deliberately not mentioning certain things in its CoT, it could probably also take care to make the CoT still appear more or less logical.

At 20% ablation, the monitor guessed that there were skipped steps only 31% of the time (true-positives)

Minor note: do all CoTs have at least one step removed in the 20% ablation setting? Earlier on, it sounded like you're randomly deciding whether to remove each line of the CoT. If a CoT is short, some of them might by chance not have any lines removed, unless I'm misunderstanding the dataset creation? (Maybe this effect basically doesn't matter though because all CoTs are much longer than 5 lines?)

[-]Gurkenglas9mo30

Why not use a base model to measure perplexity?

[-]RohanS9mo40

It could be interesting to try that too, but we thought other reasoning models are more likely to predict similar things to the large reasoning models generating the CoTs in the first place. That hopefully increases the signal to noise ratio.

LESSWRONG
LW

LESSWRONG
LW

34

Efficiently Detecting Hidden Reasoning with a Small Predictor Model

34

TL;DR

Introduction

Methodology

Data Generation

Approach 1: Tokens as Steps and Perplexity as the Metric

Approach 2: Sentences as Steps and Third Model Evaluation

Results

Approach 1: Tokens as Steps and Perplexity as the Metric

Approach 2: Sentences as Steps and Third Model Evaluation

Naive “Just Ask a Monitor” Baseline

Comparing Approaches

Discussion

Intrinsic technical limitations

Scaling challenges

Risks from fundamental assumptions

How could labs adopt our techniques?

Next Steps

Refining Detection Methods

Better Hidden Reasoning CoT Generation

Integration with Broader Monitoring Systems

34

34