Aether July 2025 Update

Rauno Arike; Shubhorup Biswas

[-]Bryce Robertson5mo80

Nice, added to the AI Safety Map and AISafety.com Communities page

[-]jefftk5mo71

Thanks for sharing this! I'd love to see more orgs posting these kinds of updates.

[-]J Bostock5mo41

We are open to feedback that might convince us to focus on these directions instead of monitorability.

What is the theory of impact for monitorability? It seems to be an even weaker technique than mechanistic interpretability, which has at best a dubious ToI.

Since monitoring is pretty superficial, it doesn't give you a signal which you can use to optimize the model directly (that would be The Most Forbidden Technique) in the limit of superintelligence. My take on monitoring is that at best it allows you to sound an alarm if you catch your AI doing something bad.

So then what's the ToI for sounding an alarm? Maybe some people are convinced, but unless your alarm is so convincing (and bare in mind that lots of people will have large financial incentives to remain unconvinced) that you either convince every frontier lab employee, or some very influential politicians, it's useless. Either the lab applies patches until it looks good (which is a variant of TMFT) or another lab which doesn't care as much comes along and builds the AI that kills us anyway.

I think most ideas surrounding "control" will cash out to optimizing the AI against the monitor in one way or another (ex: suppose we stop suspicious command line calls from being executed. If you're using an AI's successful outputs for RL, you're then implicitly optimizing for calls which bypass the monitor, even if the stopped calls aren't used as negative examples) even if they don't seem that way.

[-]RohanS5mo20

What is the theory of impact for monitorability?

Our ToI includes a) increasing the likelihood that companies and external parties notice when monitorability is degrading and even attempt interventions, b) finding interventions that genuinely enhance monitorability, as opposed to just making CoTs look more legible, and c) lowering the monitorability tax associated with interventions in b. Admittedly we probably can’t do all of these at the same time, and perhaps you’re more pessimistic than we are that acceptable interventions even exist.

It seems to be an even weaker technique than mechanistic interpretability, which has at best a dubious ToI.

There are certainly some senses in which CoT monitoring is weaker than mech interp: all of a model’s cognition must happen somewhere in its internals, and there’s no guarantee that all the relevant cognition appears in CoT. On the other hand, there are also important senses in which CoT monitoring is a stronger technique. When a model does a lot of CoT to solve a math problem, reading the CoT provides useful insight into how it solved the problem, and trying to figure that out from model internals alone seems much more complicated. We think it’s quite likely this transfers to safety-relevant settings like a model figuring out how to exfiltrate its own weights or subvert security measures.

We agree that directly training against monitors is generally a bad idea. We’re uncertain about whether it’s ever fine to optimize reasoning chains to be more readable (which need not require training against a monitor), though it seems plausible that there are techniques that can enhance monitorability without promoting obfuscation (see Part 2 of our agenda). More confidently, we would like frontier labs to adopt standardized monitorability evals that are used before deploying models internally and externally. The results of these evals should go in model cards and can help track whether models are monitorable.

in the limit of superintelligence

Our primary aim is to make ~human-level AIs more monitorable and trustworthy, as we believe that more trustworthy early TAIs make it more realistic that alignment research can be safely accelerated/automated. Ideally, even superintelligent AIs would have legible CoTs, but that’s not what we’re betting on with the agenda.

Either the lab applies patches until it looks good (which is a variant of TMFT) or another lab which doesn't care as much comes along and builds the AI that kills us anyway.

It’s fair to be concerned that knowing models are becoming unmonitorable may not buy us much, but we think this is a bit too pessimistic. It seems like frontier labs are serious about preserving monitorability (e.g. OpenAI), and maybe there are some careful interventions that actually work to improve monitorability! Parts of our agenda are aimed at studying what kinds of interventions degrade vs enhance monitorability.

[-]J Bostock5mo30

Our funder wishes to remain anonymous for now.

This is suspicious. There might be good reasons, but given the historical pattern of:

Funder with poor track record on existential safety funds new "safety" lab
"Safety" lab attracts well-intentioned talent
"Safety" lab makes everything worse

I'm worried that your funder is one of the many, many people with financial stakes in capabilities and reputational stakes in pretending to look good. The specific research direction does not fill me with hope on this front, as it kinda seems like something a frontier lab might want to fund.

[-]RohanS5mo62

Our funder is not affiliated with a frontier lab and has provided us support with no expectation of financial returns. We have also had full freedom to shape our research goals (within the broad agreed-upon scope of “LLM agent safety”).

[-]J Bostock5mo30

Thanks for the clarification, that's good to hear.

Get Involved!

Submit a short expression of interest here if you would like to contribute to Aether as a researcher, intern, external collaborator, advisor, operations person, or in any other role.

We are especially looking for collaborators with experience running RL training on open-weight LLMs, since several projects we are excited about rely on this. If you have this experience and submit an EoI, we may reach out to you about a paid role.

Join our discord with this invite link! We aim to cultivate valuable LLM agent safety research discussions here, and we’ll likely invite people to join for occasional sprints, hackathons, and project collaborations.

Get in touch with Rohan at rs4126@columbia.edu with any questions.

Research

Aether's goal is to conduct technical research that yields valuable insights into the risks and opportunities that LLM agents present for AI safety. We believe that LLM agents have substantial implications for AI alignment priorities by enabling a natural language alignment paradigm—one where agents can receive goals via natural language instructions, engage in explicit system 2 reasoning about safety specifications, and have their reasoning processes monitored by other LLMs.

Within this paradigm, we believe chain-of-thought (CoT) monitorability is a key problem to focus on. There are three reasons why we think that this is a high priority for enhancing LLM agent safety:

Shifting the field’s focus to monitorability in agentic settings: We believe that the field of AI safety should focus less on strict CoT faithfulness criteria, such as full causal faithfulness, and more on the usefulness of LLMs’ reasoning traces for downstream applications where interpretability matters most, such as monitoring in high-stakes situations.

The inference-time compute paradigm: The recent shift to the inference-time compute paradigm has resulted in models that externalize more of their thinking, meaning that there is more to be gained from monitoring the CoT compared to only monitoring actions than before.

Reducing the safety tax of externalized CoT: In the current paradigm, simply reading a model’s reasoning chain goes a long way in understanding its cognition. It’s highly important to preserve this state of affairs. Recently, Meta published a paper introducing COCONUT, a proposal to train models to perform most of their reasoning in a latent manner. Similarly, DeepSeek has claimed to be looking for alternatives to the transformer architecture, and people outside of labs have published novel recurrent architectures. By finding ways to keep the safety tax of preserving legible reasoning chains low, we hope to convince labs that it’s worth sticking with the current paradigm.

We are not unique in focusing on monitorability: all major AI companies are thinking about this and external groups have also published exciting work. However, there are several important directions for improving monitorability and not all of them will be covered by default. Two relatively neglected directions that we are particularly excited about are improving metrics and benchmarks, and training models to have more faithful and legible reasoning traces.

For more details on our thinking in this direction, see:

Other directions that we’ve been thinking about revolve around getting a better understanding of goals and beliefs in LLMs, e.g., by investigating the limitations and generalization properties of modifying LLM beliefs with synthetic document finetuning and developing model organisms of LLM misalignment that arises through reflective goal-formation. We are open to feedback that might convince us to focus on these directions instead of monitorability.

Team

Our core team is currently working full-time in-person in London.

Rohan Subramani studied CS and Math at Columbia, where he helped run an Effective Altruism group and an AI alignment group. He has done AI safety research in LASR, CHAI, MATS, and independent groups. He is starting a PhD at the University of Toronto this fall with Prof. Zhijing Jin and continuing Aether work within the PhD.

Rauno Arike studied CS and Physics at TU Delft, where he co-founded an AI alignment university group. He has done MATS 6 with Marius Hobbhahn, contracted with UK AISI, and worked as a software engineer.

Shubhorup Biswas is a CS grad and a (former) software engineer, with experience across product and infra in startups and big tech. He did MATS 7.0 with Buck Shlegeris working on AI Control for sandbagging and other low stakes failures.

We are advised by Seth Herd (Astera Institute), Marius Hobbhahn (Apollo Research), Erik Jenner (Google DeepMind), and Francis Rhys Ward (LawZero).

We have about $200,000 in funding to cover salaries and expenses through Dec 1, 2025. Our funder wishes to remain anonymous for now.

LESSWRONG
LW

LESSWRONG
LW

24

Aether July 2025 Update

24

24

Get Involved!

Research

Team