Induction vs Deduction

To be filled out with more detail

2mo

The previous two posts have emphasized some problematic scenarios for mech-interp. Mech-interp is our example of a more general problem in AI safety. In this post we zoom out to that more general problem, before proposing our solution.

We can characterize the more general problem, inherent in the causal–mechanistic paradigm, in terms of substrate-dependent and substrate-independent approaches to alignment.

As we describe in our threat model, the results/insights obtained under the causal–mechanistic paradigm are closely tied to a particular substrate. They may therefore fail to generalize to new substrates, and any downstream safety assurances may be weakened.

The problem of generalizing beyond any one particular substrate—be that model, architecture, or paradigm—has already been noted. The... (read 2507 more words →)

Management of Substrate-Sensitive AI Capabilities (MoSSAIC) Part 2: Conflict

mfatt

2mo

The previous post highlighted some salient problems for the causal–mechanistic paradigm we sketched out. Here, we'll expand on this with some plausible future scenarios that further weaken the paradigm's reliability in safety applications.

We first briefly refine our critique and outline the scenario progression.

Outline

We contend that the causal–mechanistic paradigm in AI safety research makes two implicit assertions:^[1]

Fixity of structure: That the structural properties^[2] of AI systems will, as AI capabilities increase, remain stable enough that the techniques researchers use to identify those structural properties remain relevant.
Reliability of extrapolation: That those structural properties can be reliably used to make safety assertions about AI systems.

If these assertions hold, we will be able to reliably uncover structural... (read 2569 more words →)

Management of Substrate-Sensitive AI Capabilities (MoSSAIC) Part 1: Exposition

mfatt

mfatt, Aditya

2mo

Mechanistic Interpretability

Many of you will be familiar with the following section. Please skip through to the next.

The field of mechanistic interpretability (MI) is not a single, monolithic research program but rather a rapidly evolving collection of methods, tools, and research programs. These are united by the shared ambition of reverse-engineering neural network (NN) computations and, though lacking a comprehensive uniform methodology, typically apply tools of causal analysis to understand a model from the bottom up.

MI research centres around a set of postulates. One central postulate is that NN representations can in principle be decomposed into interpretable "features"—fundamental units that "cannot be further disentangled into smaller, distinct factors"—and that these are often encoded... (read 1420 more words →)

Management of Substrate-Sensitive AI Capabilities (MoSSAIC) Part 0: Overture

mfatt

3mo

Tunings/Preamble

Notes on the programme

What follows is an overview of the incoming sequence.

It is an edited version of the research I have been developing over the past year or so. Most recent version here.

The initial position paper (co-authors Chris Pang, Sahil K) was presented at the Tokyo AI Safety Conference 2025. The developed position paper (co-authors Aditya Prasad, Aditya Adiga, Jayson Amati, Sahil K) has just been accepted for the Proceedings of Iliad and is now part of BlueDot's new Technical AI Safety Curriculum (Week 4: Understanding AI).

This research will eventually and fortuitously emphasize the need for Sahil's Live Theory Agenda as a solution. This is no accident. These ideas were initially his... (read 1191 more words →)

Replying toWhy I'm Moving from Mechanistic to Prosaic Interpretability

mfatt1y

Why I'm Moving from Mechanistic to Prosaic Interpretability

Here's a speculative correspondence that I hope might eventually have some applicability somewhere.

I view some of this as reflecting the difference between criminology and crime science. From what I understand criminology focuses more on the individual, using sociology, psychology, etc; it seems inherently internal, looking for deeper answers to why people commit crime. Crime science instead focuses more on a practical, "surface-level" approach, using a combination of empirical techniques to predict and prevent crime.

To my mind mech-interp seems more the former, and prosaic interp more the latter.

Here's a reference in case anyone's interested. Curious to hear if anyone has thought about this before of has any takes.

What's so special about likelihoods?

mfatt

This is a coin.

It might be biased.

This is Bayes’ theorem.

$P (A | B) = \frac{P (B | A) \times P (A)}{P (B)}$

Bayes’ theorem tells us how we ought to update our beliefs given evidence.

It involves the following components:

$P (A | B)$ , called the posterior; it is the probability of A given B. In the case of the coin this is the probability that the coin is biased given the result of an experiment (i.e., a sequence of flips).
$P (B | A)$ , called the likelihood, is the probability of B given A. For our coin example this would be the probability of some particular ratio of heads to tails, given that the coin is biased.
$P (A)$ is called the prior. This is the probability that the coin is biased, before we consider any

... (read 201 more words →)

Replying toIdealized Agents Are Approximate Causal Mirrors (+ Radical Optimism on Agent Foundations)

mfatt2y

Idealized Agents Are Approximate Causal Mirrors (+ Radical Optimism on Agent Foundations)

Just highlighting an overlap between the ideas expressed here and a stream that has recently been added to the MATS Summer 2024 Program.

This is not a direct extension of the work but something that shares some of intuitions and might help to formalise the ideas expressed in the post.

The grant proposal for the work is here. The proposal was submitted to Manifund (see here), where it was noticed by Ryan Kidd and subsequently added to the MATS program instead of receiving direct funding.

Do have a read and/or reach out if you're interested!

LESSWRONG
LW

LESSWRONG
LW

mfatt

Management of Substrate-Sensitive AI Capabilities (MoSSAIC) Part 0: Overture

Management of Substrate-Sensitive AI Capabilities (MoSSAIC) Part 1: Exposition

Management of Substrate-Sensitive AI Capabilities (MoSSAIC) Part 3: Resolution

Management of Substrate-Sensitive AI Capabilities (MoSSAIC) Part 2: Conflict

mfatt

mfatt

Management of Substrate-Sensitive AI Capabilities (MoSSAIC) Part 3: Resolution

Management of Substrate-Sensitive AI Capabilities (MoSSAIC) Part 2: Conflict

Management of Substrate-Sensitive AI Capabilities (MoSSAIC) Part 1: Exposition

Management of Substrate-Sensitive AI Capabilities (MoSSAIC) Part 0: Overture

What's so special about likelihoods?

Management of Substrate-sensitive ai capabilities (MoSSAIC)

mfatt

Management of Substrate-Sensitive AI Capabilities (MoSSAIC) Part 0: Overture

Management of Substrate-Sensitive AI Capabilities (MoSSAIC) Part 1: Exposition

Management of Substrate-Sensitive AI Capabilities (MoSSAIC) Part 3: Resolution

Management of Substrate-Sensitive AI Capabilities (MoSSAIC) Part 2: Conflict

mfatt

mfatt

Management of Substrate-Sensitive AI Capabilities (MoSSAIC) Part 3: Resolution

Management of Substrate-Sensitive AI Capabilities (MoSSAIC) Part 2: Conflict

Management of Substrate-Sensitive AI Capabilities (MoSSAIC) Part 1: Exposition

Management of Substrate-Sensitive AI Capabilities (MoSSAIC) Part 0: Overture

What's so special about likelihoods?

Management of Substrate-sensitive ai capabilities (MoSSAIC)

Outline

Mechanistic Interpretability

Tunings/Preamble

Notes on the programme