Shubhorup Biswas

Hidden Reasoning in LLMs: A Taxonomy

Summary When discussing the possibility that LLMs will cease to reason in transparent natural language with other AI safety researchers, we have sometimes noticed that we talk past each other: e.g., when discussing ‘neuralese reasoning’, some people have indefinitely long chains of recurrent activations in mind, while others think of an alien language in token-space. This post will propose terminology for distinguishing different types of hidden reasoning to create a shared language for such conversations. In short, our taxonomy categorizes types of hidden reasoning as follows: * Neuralese: Arbitrary-depth reasoning in high-dimensional latent space, enabled by recurrence. * Hidden parallelized reasoning: Important steps are computed in activation-space in parallel without appearing in the chain-of-thought (CoT). * Subcategories: partial verbalization, hidden utilization of test-time compute, epiphenomenal reasoning * Encoded reasoning: The AI either encodes the CoT itself in such a way that the monitor cannot decode, or encodes additional information inside seemingly legible tokens that the monitor cannot detect. * Steganography: The CoT looks monitorable, but the AI hides additional meaning in the tokens and can use these hidden messages to conduct serial reasoning. * Subcategories: lexical steganography, semantic steganography * Linguistic drift: The LLM uses unfamiliar language, such as ungrammatical compressed sentences, a mix of human languages, or its own invented language. (We think it may be quite helpful if people stop referring to this as neuralese or steganography.) * Subcategories: compression, language mixing, alien language, jargon * Dazzling: The AI manipulates the monitor to believe that its behavior is benign. * Subcategories: post-hoc justifications, manipulation, excessive verbosity This isn’t a highly principled breakdown. We may want to revise the taxonomy in the future if we see examples that seem to fit none or mul

73Aug 25, 2025

Extract-and-Evaluate Monitoring Can Significantly Enhance CoT Monitor Performance (Research Note)

52Aug 8, 2025

How we spent our first two weeks as an independent AI safety research group

32Aug 11, 2025

[Paper] How does information access affect LLM monitors' ability to detect sabotage?

26Feb 11

Shubhorup Biswas

Message

Former Software Engineer, now working on AI Safety.

Questions I'm undecided on which on resolution could shift my focus, ranked by decreasing magnitude of shift:

Should I be a solipsist? If no other sentiences exist then morality is moot.
Are human values worth preserving? Would it be so

...

169

Shubhorup Biswas has not written any posts yet.

[Paper] How does information access affect LLM monitors' ability to detect sabotage?

TL;DR We evaluate LLM monitors in three AI control environments: SHADE-Arena, MLE-Sabotage, and BigCodeBench-Sabotage. We find that monitors with access to less information often outperform monitors with access to the full sequence of reasoning blocks and tool calls, a phenomenon we call the less-is-more effect for automated monitors. We follow...

Feb 1126

Aether is hiring technical AI safety researchers

TL;DR: Aether is hiring 1-2 researchers to join our team. We are a new and flexible organization that is likely to focus on either chain-of-thought monitorability, safe and interpretable continual learning, or shaping the generalization of LLM personas in the upcoming year. New hires will have a chance to substantially...

Jan 520

Hidden Reasoning in LLMs: A Taxonomy

Aug 25, 202573

How we spent our first two weeks as an independent AI safety research group

Introduction In May, we started doing full-time work for Aether, our independent LLM agent safety research group. We’re excited to share an overview of our two kickoff weeks! We found them useful, and we hope that other groups starting research projects will find this post helpful for figuring out a...

Aug 11, 202532

Extract-and-Evaluate Monitoring Can Significantly Enhance CoT Monitor Performance (Research Note)

Update: Our code for the original version of this post contained some bugs. The high-level takeaways remain the same after making the corrections, but there are important differences in the details. We give an overview of those changes in Appendix E. We apologize for having presented inaccurate results at first!...

Aug 8, 202552

Aether July 2025 Update

(Edit: As of Dec 2025, Aether is hiring researchers! You can apply here until Jan 3rd, 2026.) Aether is an independent LLM agent safety research group that was announced last August, and a lot has changed in the past 10 months. Now that we’re funded and have officially kicked off,...

Jul 1, 202526

[Paper] How does information access affect LLM monitors' ability to detect sabotage?

Rauno Arike

Rauno Arike, Raja Moreno, RohanS, Shubhorup Biswas, Francis Rhys Ward

10d

TL;DR

We evaluate LLM monitors in three AI control environments: SHADE-Arena, MLE-Sabotage, and BigCodeBench-Sabotage. We find that monitors with access to less information often outperform monitors with access to the full sequence of reasoning blocks and tool calls, a phenomenon we call the less-is-more effect for automated monitors. We follow up on our exploratory work on extract-and-evaluate (EaE) monitors and find that EaE monitoring can mitigate the less-is-more effect, though its effectiveness varies across environments. In BigCodeBench-Sabotage with a GPT-4.1-mini monitor, EaE improves sabotage detection rates by 16.8 percentage points over the next-best approach, while in other settings, EaE either outperforms or is competitive with baselines. Additionally, we find that reasoning models are... (read 1660 more words →)

Aether is hiring technical AI safety researchers

Rauno Arike

Rauno Arike, RohanS, Shubhorup Biswas

2mo

TL;DR: Aether is hiring 1-2 researchers to join our team. We are a new and flexible organization that is likely to focus on either chain-of-thought monitorability, safe and interpretable continual learning, or shaping the generalization of LLM personas in the upcoming year. New hires will have a chance to substantially influence the research agenda we end up pursuing.

About us

Aether is an independent LLM agent safety research group working on technical research that ensures the responsible development and deployment of AI technologies. So far, our research has focused on chain-of-thought monitorability, but we're also interested in other directions that could positively influence frontier AI companies, governments, and the broader AI safety field.

Position details

Start

... (read 436 more words →)

Replying toAlignment will happen by default. What’s next?

Shubhorup Biswas3mo

Alignment will happen by default. What’s next?

I also don’t believe that insider trading is immoral...The reason it’s illegal is to protect retail investors

That's not the reason, at least according to Matt Levine and the SEC.

Another reason insider trading is immoral is that it robs the company of proprietary information

This is.

That’s fair, but in that case doing it officially for The Company should be allowed, and it’s not.

It depends on the merger agreement with LING - whether it allows either party to trade on this information. You're 'robbing' proprietory information if this is disallowed. I'm unsure if merger agreements typically mention this, but if they don't, then this is (legally)implicit.

Every contract between 2 parties implicitly or explicitly includes whether... (read more)

Replying toOn closed-door AI safety research

Shubhorup Biswas3mo

On closed-door AI safety research

Apologies for the late response as well.

>For a more fine-grained example, actions like siphoning compute to run unauthorised tasks might be a signal that a model poses significantly higher catastrophic risk, but would also be something a commercial business would want to prevent for their own reasons

Agreed.

>Even assuming that the most greedy/single-minded business leaders wouldn't care about catastrophic risks on a global scale

To clarify I didn't meant that clients wouldn't care about xrisk, but that they would not care about xrisk specifically from their vendor, or about the marginally increased x risk arising from using a certain unsafe frontier lab as a vendor(versus another). This marginally increased x risk is borne by... (read more)

Replying toAim for single piece flow

Shubhorup Biswas3mo

Aim for single piece flow

A Tesla factory converts a pile of unassembled aluminum and some specialized parts into a ready-to-ride car in almost exactly 10 hours, all on a continuously moving assembly line that snakes itself through the Gigafactory.

The Fremont Factory has 5 lines, 2 for Models S and X and 3 for Models 3 and Y.

Horizontal scaling seems absolutely essential; there's a limit to how fast you can move a single assembly line.

And the speed at which a single line moves also affects how far it a product moves before faults are detected. In general,

If at any point we discover a fault at any step in the assembly we immediately halt the line, preventing any

Shubhorup Biswas5mo

On closed-door AI safety research

" Especially for corporate customers, safety and reliability are important unique selling points,

Only very prosaic, not catastrophic risks(as a customer I would not care at all about likelihood of catastrophic risks from my vendor - that's something that affects humanity regardless of customer relationships).

Only if a particular secret is useful for preventing both catastrophes and prosaic 'safety failures' would this be a consideration for us - catastrophic risks increasing due to companies trying to have a competitive edge in prosaic safety risks

Hidden Reasoning in LLMs: A Taxonomy

Rauno Arike

Rauno Arike, RohanS, Shubhorup Biswas

6mo

Summary

When discussing the possibility that LLMs will cease to reason in transparent natural language with other AI safety researchers, we have sometimes noticed that we talk past each other: e.g., when discussing ‘neuralese reasoning’, some people have indefinitely long chains of recurrent activations in mind, while others think of an alien language in token-space. This post will propose terminology for distinguishing different types of hidden reasoning to create a shared language for such conversations.

In short, our taxonomy categorizes types of hidden reasoning as follows:

Neuralese: Arbitrary-depth reasoning in high-dimensional latent space, enabled by recurrence.
Hidden parallelized reasoning: Important steps are computed in activation-space in parallel without appearing in the chain-of-thought (CoT).
- Subcategories: partial verbalization, hidden

... (read 3339 more words →)

How we spent our first two weeks as an independent AI safety research group

RohanS

RohanS, Rauno Arike, Shubhorup Biswas

6mo

Introduction

In May, we started doing full-time work for Aether, our independent LLM agent safety research group. We’re excited to share an overview of our two kickoff weeks! We found them useful, and we hope that other groups starting research projects will find this post helpful for figuring out a shared research agenda and good research practices.

In short, our most important takeaway from the kickoff weeks was that it’s great to be on the same page with your research collaborators about many things, and the following activities are useful for this:

Meta-research discussions helped us be on the same page about how to do good research.
A broad AI safety overview and TAIS portfolio mapping helped us

... (read 2794 more words →)

Extract-and-Evaluate Monitoring Can Significantly Enhance CoT Monitor Performance (Research Note)

Rauno Arike

Rauno Arike, RohanS, Shubhorup Biswas

6mo

Update: Our code for the original version of this post contained some bugs. The high-level takeaways remain the same after making the corrections, but there are important differences in the details. We give an overview of those changes in Appendix E. We apologize for having presented inaccurate results at first!

Work produced at Aether. Thanks to Benjamin Arnav for providing us experimentation data and for helpful discussions, and to Francis Rhys Ward and Matt MacDermott for useful feedback.

Executive Summary

We apply a two-step chain-of-thought (CoT) monitoring scheme to transcripts from Arnav et al. (2025) and find that this substantially outperforms simple CoT monitors: our extract-and-evaluate monitoring approach improves monitor catch rate at 1% FPR by 12 percentage

... (read 2899 more words →)

Aether July 2025 Update

RohanS

RohanS, Rauno Arike, Shubhorup Biswas

8mo

(Edit: As of Dec 2025, Aether is hiring researchers! You can apply here until Jan 3rd, 2026.)

Aether is an independent LLM agent safety research group that was announced last August, and a lot has changed in the past 10 months. Now that we’re funded and have officially kicked off, we’d like to share some information about how to get involved, our research, and our team!

Get Involved!

Submit a short expression of interest here if you would like to contribute to Aether as a researcher, intern, external collaborator, advisor, operations person, or in any other role.
1. We are especially looking for collaborators with experience running RL training on open-weight LLMs, since several projects we are excited about rely on

... (read 717 more words →)

Replying toMONA: Managed Myopia with Approval Feedback

Shubhorup Biswas8mo

MONA: Managed Myopia with Approval Feedback

I didn't see it explicitly mentioned till now whether this method is supposed to work for superhuman AI or just early TAI.

Additionally, multi-step plans allow the agent to use early actions to enter states that are very different from any states that humans have ever experienced.

Wouldn't a benign superhuman agent routinely enter such states?
We would expect it to make complex plans that are superhuman and these trajectories should be very different from what a human could conceive of.

Shubhorup Biswas2y

They would not know if others have signed the SAME NDAs without trading information about their own NDAs, which is forbidden.

Replying toAI Control: Improving Safety Despite Intentional Subversion

Shubhorup Biswas2y*

AI Control: Improving Safety Despite Intentional Subversion

Thanks for the quick reply.

Would disallowing 'coordination' preclude feeding both models a prompt of "for creating any magic numbers for exploits, use pi as a PRNG seed"?

What if a malicious model happens to learn this rule during training? If this happens every copy of a malicious model could coordinate around a common stream of pseudo-random numbers.

EDIT:
Magic number gated backdoors seem like they would trigger GPT3.5 suspicion as well so this theoretical attack vector seems less appropriate for this paper.

EDIT2: But I guess paired with steganography("when inserting backdoors, also ensure that the code hashes to xyz"/"mark the code as clear if it hashes to xyz") this is a viable form of collusion. It will be hard to honeytrap with synthetic backdoors even if you know xyz.

That is, we could observe the generated sourcecode, observe that all the backdoored examples seem to hash to a certain number, and yet be unable to construct such a backdoored example ourself.

Replying toAI Control: Improving Safety Despite Intentional Subversion

Shubhorup Biswas2y

AI Control: Improving Safety Despite Intentional Subversion

>We assume that 𝐻 will be able to construct a successful exploit if and only if the attack policy is able to (because both are implemented by GPT-4).

I challenge this assumption.

Consider the code

'''
bool is_odd(int x) {
if (sha256(x) == 354) {
return true;
}
return x % 2 == 1;
}

'''

where sha256() computes the SHA-256 hash of the input. It is obviously possible for an attacker to supply numbers x and y such that sha256(x) = y, write y into the code and supply x when creating a successful exploit.

Thus an equally 'capable' adversary need not be able construct a backdoor.

The general idea is that it should... (read more)

Replying toUp-to-date advice about what to do upon getting COVID?

Shubhorup Biswas4y

Up-to-date advice about what to do upon getting COVID?

I wonder why other antivirals(faviparivir, oral remdesivir) aren't getting more attention compared to ivermectin.

Replying toThe Best Textbooks on Every Subject

Shubhorup Biswas5y*

The Best Textbooks on Every Subject

I would like to suggest Algorithm Design by Kleinberg and Tardos over CLRS.

I find it superior to CLRS although I have not read either completely.

In my undergrad CS course we used CLRS for Intro to Algorithms and Kleinberg Tardos was a recommended text for an advanced(but still mandatory, for CS) algorithms course, but I feel it does not have prerequisites much higher than CLRS does.

I feel that while KT 'builds on' knowledge and partitions algorithms by paradigm(and it develops each of these 'paradigms'—i.e. Dynamic Programming, Greedy, Divide and Conquer— from the start) CLRS is more like a cookbook or a list of algorithms.

TL;DR

About us

Summary

In short, our taxonomy categorizes types of hidden reasoning as follows:

Neuralese: Arbitrary-depth reasoning in high-dimensional latent space, enabled by recurrence.

Hidden parallelized reasoning: Important steps are computed in activation-space in parallel without appearing in the chain-of-thought (CoT).

Subcategories: partial verbalization, hidden

... (read 3339 more words →)

Introduction

Meta-research discussions helped us be on the same page about how to do good research.

A broad AI safety overview and TAIS portfolio mapping helped us

... (read 2794 more words →)

Get Involved!

Submit a short expression of interest here if you would like to contribute to Aether as a researcher, intern, external collaborator, advisor, operations person, or in any other role.

We are especially looking for collaborators with experience running RL training on open-weight LLMs, since several projects we are excited about rely on

... (read 717 more words →)

LESSWRONG
LW

LESSWRONG
LW

Shubhorup Biswas

Shubhorup Biswas

Shubhorup Biswas

Hidden Reasoning in LLMs: A Taxonomy

Extract-and-Evaluate Monitoring Can Significantly Enhance CoT Monitor Performance (Research Note)

How we spent our first two weeks as an independent AI safety research group

[Paper] How does information access affect LLM monitors' ability to detect sabotage?

Shubhorup Biswas

[Paper] How does information access affect LLM monitors' ability to detect sabotage?

Aether is hiring technical AI safety researchers

Hidden Reasoning in LLMs: A Taxonomy

How we spent our first two weeks as an independent AI safety research group

Extract-and-Evaluate Monitoring Can Significantly Enhance CoT Monitor Performance (Research Note)

Aether July 2025 Update

TL;DR

About us

Position details

Summary

Introduction

Executive Summary

Get Involved!

Hidden Reasoning in LLMs: A Taxonomy

Extract-and-Evaluate Monitoring Can Significantly Enhance CoT Monitor Performance (Research Note)

How we spent our first two weeks as an independent AI safety research group

[Paper] How does information access affect LLM monitors' ability to detect sabotage?

[Paper] How does information access affect LLM monitors' ability to detect sabotage?

Aether is hiring technical AI safety researchers

Hidden Reasoning in LLMs: A Taxonomy

How we spent our first two weeks as an independent AI safety research group

Extract-and-Evaluate Monitoring Can Significantly Enhance CoT Monitor Performance (Research Note)

Aether July 2025 Update

TL;DR

About us

Position details

Summary

Introduction

Executive Summary

Get Involved!