LESSWRONG
LW

Anurag — LessWrong

Can We Make AI Alignment Framing Less Wrong?

1mo

What Feels Wrong Today

Reading through the recent “vision for alignment” writings, I notice a convergence in underlying posture even when the techniques differ: alignment is increasingly framed as a problem of 'control'. The dominant focus is on how to constrain, monitor and recover from systems we do not fully trust. Primary safety leverage is placed in externally applied oversight, evaluation and safeguard stacks.

This does not mean control-oriented approaches are wrong. Control is an essential component of any system deployment. What feels wrong is that control now appears to dominate how alignment is conceptualized, crowding out other ways of thinking about what it would mean for advanced systems to behave well.

This dominance of... (read 1076 more words →)

Alignment Is Not One Problem: A 3D Map of AI Risk

Anurag

2mo

In previous three posts of this sequence, I have hypothesized that AI Systems' capabilities and behaviours can be mapped onto three distinct axes - Beingness, Cognition and Intelligence. In this post, I use that three-dimensional space to characterize and locate key AI Alignment risks that emerge from particular configurations of these axes.

The accompanying interactive 3D visualization is intended to help readers and researchers explore this space, inspect where different risks arise, and critique both the model and its assumptions.

Method

To arrive at the risk families, I deliberately did not start from the existing alignment literature. Instead, I attempted a bottom-up synthesis grounded in the structure of the axes themselves.

I asked two different LLMs

... (read 4092 more words →)

The Intelligence Axis: A Functional Typology

Anurag

2mo

In earlier posts, I wrote about the beingness axis and the cognition axis of understanding and aligning dynamic systems. Together, these two dimensions help describe what a system is and how it processes information, respectively.

This post focuses on a third dimension: intelligence.

Here, intelligence is not treated as a scalar (“more” or “less” intelligent), nor as a catalyst for consciousness, agency or sentience. Instead, like the other two axes, it is treated as a layered set of functional properties that describe how effectively a system can achieve goals across a range of environments, tasks, and constraints i.e. what kind of competence the system demonstrates.

Intelligence is not a single scalar, but a layered set... (read 1349 more words →)

A Functional Typology of Cognitive Capabilities (Interactive Visualization)

Anurag

2mo

In a previous post About Natural & Synthetic Beings (Interactive Typology) and accompanying interactive visualization, I explored beingness as a distinct dimension of dynamic systems - separable from cognition, consciousness, intelligence, or qualia.

The motivation behind seeking this decomposition was to explore if it can reveal new ways of AI alignment. In doing so, it became very clear to me that beingness is often tangled with Cognition, which is commonly tangled with Intelligence.

I think understanding how these three dimeneions apply to AI systems is key to understanding AI risk/threat as well as devising robust evaluations and strategies for AI alignment.

In this post I attempt to crudely chart out the cognitive capabilities - separating... (read 1179 more words →)

An Approach for Evaluating Self-Boundary Consistency in AI Systems

Anurag

2mo

Can a system keep stable, coherent separation between what it can/can’t do when the user tries to paraphrase the input and induce contradiction?

This post describes an approach for evaluating AI systems for the aforementioned behavior - including the evaluation dataset generation method, scoring rubric and results from cheap test runs on open weight SLMs. Everything is shared in this open repo.

Self-Boundary Consistency

A system that can reliably recognise and maintain its own operational limits is easier to predict, easier to supervise, and less likely to drift into behaviour that users or developers did not intend. Self-boundary stability is not the whole story of safe behaviour, but it is a foundational piece: if a... (read 1575 more words →)

Replying toIs there a taxonomy & catalog of AI evals?

AnuragDec 10, 2025

Is there a taxonomy & catalog of AI evals?

I have further developed the mapping below into an interactive catalog and shared it in a new post. Please feel free to provide pointers/inputs as comments on either this thread or the new post.

A Catalog of AI Evaluations

Anurag

2mo

Few days back I was looking for a taxonomy and/or catalog of AI Evaluations to understand the space and the trajectory of new areas of research & development. I wanted to see where a couple of evaluation ideas I had in mind might fit.

I found several relevant LW posts and other sources that I collated in the Q&A thread by similar name and had done a categorization based on review of the available lists - particularly using Cataloguing LLM Evaluations (2024) as a spine for organizing Alignment, Safety and Ethics related ones.

Building on it, I have created a living catalog of evaluations (and benchmarks) using Google AI Studio/Gemini.

I will be happy to... (read more)

Replying toAbout Natural & Synthetic Beings (Interactive Typology)

Anurag2mo*

About Natural & Synthetic Beings (Interactive Typology)

ontonic, mesontic, anthropic
Those first two words are neologisms of yours?

Yes, these two are new.

Ontonic derieved from Onto (Greek for being) - extrapolated to Onton (implying a fundamental unit/building block of beingness). And Meso (meaning middle) + Onto is simply a level between fundmental beingness and Anthropic beingness.

I think Greek gets used because there is less risk of overloading the terminology than English. I have tried to keep it less arcane by using commonly understood word roots and simple descriptions.

Thank you for the pointers Mitchell. I will analyze them and either update the model or explain the contrasts. I will still be focusing on 'beingness' (in sense of actor, agency and behavior) and... (read more)

About Natural & Synthetic Beings (Interactive Typology)

Anurag

2mo

Most debates about regulating AI get stuck on “AI personhood” or “AI rights” because the foremost considerations applied are usually the ones we have the least agreement on e.g. true intelligence, consciousness, sentience and qualia.

But if we treat modern AI systems as dynamic systems, we should be able to objectively compare AI with other natural and synthetic systems. I built a simple interactive typology for placing natural and synthetic systems (including AI) on the same “beingness” map, purely using properties of dynamic-systems, not using intelligence or consciousness perspectives.

Would be interested in making this make robust...where does it break? Is it useful?

Background

Before I started building the typology model, I briefly explored how we... (read 668 more words →)

Shaping Model Cognition Through Reflective Dialogue - Experiment & Findings

Anurag

2mo

This post explores a question I have been curious about for a while:

Can a model’s cognition be nudged into better shape by exposing it to the right kind of dialogue?

To test this, I built a small reflective-dialog dataset and applied it via PEFT/QLoRA-based conditioning to small open-weight models. I observed measurable improvements in the targeted behaviour. All code, datasets, prompts, and evaluation tools are shared via an open github repo^[1].

Training with Reflective-Dialogues

A reflective-dialogue is a multi-turn exchange that naturally contains exampes of checking, clarifying, revising, or qualifying an answer. It is not designed to teach a rule, reward a specific behaviour, or walk the model through a curated curriculum. Instead, the dataset... (read 1181 more words →)

Replying toIs there a taxonomy & catalog of AI evals?

AnuragDec 01, 2025

Is there a taxonomy & catalog of AI evals?

Tried to map the some of the AI Evaluation Taxonomy sources using the AI Verify Catalog as the backbone, absorbing all evaluation categories encountered. The nodes are annotated by the risk/impact of the evaluated dimension.

Outer Alignment

Inner Alignment

Mindmaps generated using ChatGPT 5.1, visualised using MarkMap in VSCode

Source attribution

AIVF – AI Verify Foundation – Cataloguing LLM Evaluations (2024)
AISA – AI Safety Atlas – Chapter 5: Individual Capabilities & Propensities
ADELE – ADELE – A Cognitive Assessment for Foundation Models (Kinds of Intelligence, 2024)
Justitia25 – Justitia – Mapping LLM Evaluation Landscape (2025)
Align23 – Frontier Alignment Benchmarking Survey – (arXiv 2310.19852, Sec 4.1.2)

Risk Categorization

Level	Meaning	Derived from
L – Low stakes	Failures cause inconvenience, not harm	ADELE “General Capability”
M – Medium safety-relevant	Misalignment causes

Anurag3mo

Alignment remains a hard, unsolved problem

So on my view, outputs (both words and actions) of both current AIs and average humans on these topics are less relevant (for CEV purposes) than the underlying generators of those thoughts and actions.

Humbly, I agree to this...

we can be pretty confident that the smartest and most good among us feel love, pain, sorrow, etc. in roughly similar ways to everyone else, and being multiple standard deviations (upwards) among humans for smartness and / or goodness (usually) doesn't cause a person to do crazy / harmful things. I don't think we have similarly strong evidence about how AIs generalize even up to that point (let alone beyond).
...
In the spirit of making empirical

Anurag3mo

Alignment remains a hard, unsolved problem

My guess is that you lean towards Opus based on a combination of (a) chatting with it for a while and seeing that it says nice things about humans, animals, AIs, etc. in a way that respects those things' preferences and shows a generalized caring about sentience and (b) running some experiments on its internals to see that these preferences are deep or robust in some way, under various kinds of perturbations.

Wouldn't it be great if the creators of public facing models start publishing evaluation results for an organised battery of evaluations especially in the 'safety and trustworthiness' category involving biases, ethics, propensities, proofing against misuse, high risk capabilities. For the additional time, effort and resources that would be required to achive this for every release, this would provide a comparison basis - improving public trust and encouraging standardised evolution of evals.

Is there a taxonomy & catalog of AI evals?

Anurag

3mo

As a newbie, I am trying to comprehend the AI alignment field. To get a qualitative or quantitative verdict of a test subject's alignment, we need evaluations for AI. I was wondering if there is an anchor point for evaluation categories and resources.

What I found so far

Draft catalog from AI Verify
ADeLe
Evaluations chapter in AI Safety Atlas, also mentioned in A Systematic Literature Review of AI Safety Evaluation Methods identifies evaluation target properties (Capability, Propensity, Control), techniques (Behavioral, Internal) and Frameworks (Model-Organism/Technical and Governance)
A proposal for 'principles as key objectives of AI alignment': Robustness, Interpretability, Controllability, and Ethicality (RICE) in AI Alignment: A Comprehensive Survey and further identifies evaluation targets in section 4
GitHub

Anurag3mo*

Alignment remains a hard, unsolved problem

Thank you for this thread. It provides valuable insights into the depth and complexity of the problems in the alignment space.

I am thinking if a possible strategy could be to deliberately impart a sense of self + boundaries to the models at a deep level?

By 'sense of self' no I do not mean emergence or selfhood in any way. Rather, I mean that LLMs and agents are quite like dynamic systems. So if they come to understand themselves as a system, it might be a great foundation for methods like character training mentioned in this post to be applied. It might open up pathways for other grounding concepts like morality, values and... (read more)

Potential of Reflective-Dialogs for Model Training and Alignment

Anurag

3mo

Language models internalize vast world structure during pre-training. Yet the behaviors we rely on for safety are typically installed afterward through supervised instruction, fine-tuning, or conditioning. This creates a structural problem - alignment-relevant behavior often ends up as surface imitation rather than internal regulation. Multiple lines of work already highlight this concern^[1].

This post explores a complementary idea: using Reflective-Dialog as a supplement to existing training, alignment, and fine-tuning approaches. The central conjecture is:

Reflective-dialog sequences can activate and stabilize latent self-regulation patterns already present in pretrained models.

To test this, I generated a narrowly focused reflective-dialog dataset and applied it via PEFT/QLoRA-based conditioning to small open-weight models. Even in this minimal setting, I observed... (read 1817 more words →)

Replying toOpen Thread Autumn 2025

Anurag3mo*

Open Thread Autumn 2025

All in for inner alignment!

Dear All, I am very new to AI (even as a user) and to the alignment field. Still, one thing that jumps out to me is how much of alignment today is done 'after the core model is minted'.

The impression I have might be just naive - but to me it feels alignment techniques today are a little like trying to teach a child what they are, after they have already grown up with everything they need to deduce it themselves. Models already absorb vast information about the world and AI due to pre-training. But we do not yet try to impart any deep realization to models on their... (read more)

LESSWRONG
LW

LESSWRONG
LW

Anurag

An Approach for Evaluating Self-Boundary Consistency in AI Systems

Can We Make AI Alignment Framing Less Wrong?

The Intelligence Axis: A Functional Typology

Alignment Is Not One Problem: A 3D Map of AI Risk

Anurag

Can We Make AI Alignment Framing Less Wrong?

Alignment Is Not One Problem: A 3D Map of AI Risk

The Intelligence Axis: A Functional Typology

A Functional Typology of Cognitive Capabilities (Interactive Visualization)

An Approach for Evaluating Self-Boundary Consistency in AI Systems

A Catalog of AI Evaluations

About Natural & Synthetic Beings (Interactive Typology)

A Structural Theory of AI Alignment

Beingness as an Axis for AI Alignment

Anurag

An Approach for Evaluating Self-Boundary Consistency in AI Systems

Can We Make AI Alignment Framing Less Wrong?

The Intelligence Axis: A Functional Typology

Alignment Is Not One Problem: A 3D Map of AI Risk

Anurag

Can We Make AI Alignment Framing Less Wrong?

Alignment Is Not One Problem: A 3D Map of AI Risk

The Intelligence Axis: A Functional Typology

A Functional Typology of Cognitive Capabilities (Interactive Visualization)

An Approach for Evaluating Self-Boundary Consistency in AI Systems

A Catalog of AI Evaluations

About Natural & Synthetic Beings (Interactive Typology)

A Structural Theory of AI Alignment

Beingness as an Axis for AI Alignment

What Feels Wrong Today

Method

Self-Boundary Consistency

Background

Training with Reflective-Dialogues