Barriers to Mechanistic Interpretability for AGI Safety

Connor Leahy

Barriers to Mechanistic Interpretability for AGI Safety

by Connor Leahy

1 min read29th Aug 202313 comments

69 Ω 23

Conjecture (org)Interpretability (ML & AI)AI

Frontpage

Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

This is a linkpost for https://www.youtube.com/watch?v=wKI9hmaIbpg

I gave a talk at MIT in March earlier this year on barriers to mechanistic interpretability being helpful to AGI/ASI safety, and why by default it will likely be net dangerous. Several people seem to be coming to similar conclusions recently (e.g., this recent post).

I discuss two major points (by no means exhaustive), one technical and one political, that present barriers to MI addressing AGI risk:

AGI cognition is interactive. AGI systems interact with their environment, learn online and will externalize massive parts of their cognition into the environment. If you want to reason about such a system, you also need a model of the environment. Worse still, AGI cognition is reflective, and you will also need a model of cognition/learning.
(Most) MI will lead to capabilities, not oversight. Institutions are not set up and do not have the incentives to resist using capabilities gains and submit to monitoring and control.

This being said, there are more nuances to this opinion, and a lot of it is downstream of lack of coordination and the downsides of publishing in an adversarial environment like we are in right now. I still endorse the work done by e.g. Chris Olah's team as brilliant, but extremely early, scientific work that has a lot of steep epistemological hurdles to overcome, but I unfortunately also believe that on net work such as Olah's is at the moment more useful as a safety-washing tool for AGI labs like Anthropic than actually making a dent on existential risk concerns.

Here are the slides from my talk, and you can find the video here.

Mentioned in

41Technical AI Safety Research Landscape [Slides]

Barriers to Mechanistic Interpretability for AGI Safety

New Comment

13 comments, sorted by

top scoring

Click to highlight new comments since: Today at 5:47 AM

[-]Gunnar_Zarncke8mo71

AGI systems interact with their environment, learn online and will externalize massive parts of their cognition into the environment.

This comment is not about interpretability but a generalization of the question.

What is the AGI system and what is the environment? Where does the AGI system draw the boundary when reasoning about itself?

For humans, there is a clearer agent - environment distinction because we have bodies with a relatively clear physical boundary (though some people might already see their body as part of the environment and only count their brain or even their mind, however delineated). For AGI systems it is less clear. Is it the running software, the computers, the whole compute center, or even the organization keeping the machines running?

[-]Connor Leahy8mo60

Yep, you see the problem! It's tempting to just think of an AI as "just the model", and study that in isolation, but that just won't be good enough longterm.

[-]mesaoptimizer8mo2-1

I see -- you are implying that an AI model will leverage external system parts to augment itself. For example, a neural network would use an external scratch-pad as a different form of memory for itself. Or instantiate a clone of itself to do a certain task for it. Or perhaps use some sort of scaffolding.

I think these concerns probably don't matter for an AGI, because I expect that data transfer latency would be a non-trivial blocker for storing data outside the model itself, and it is more efficient to to self-modify and improve one's own intelligence than to use some form of 'factored cognition'. Perhaps these things are issues for an ostensibly boxed AGI, and if that is the case, then this makes a lot of sense.

[-]Connor Leahy8mo30

I strongly disagree and do not think that will be how AGI will look, AGI isn't magic. But this is a crux and I might be wrong of course.

[-]Noosphere898mo20

Yep, the latency and performance are real killers for embodied type cognition. I remember a tweet that suggested the entire Internet was not enough to train the model.

[-]Gunnar_Zarncke8mo20

It would be nice if the AGI saw the humans running its compute resources as part of its body that it wants to protect. The problem is that we humans also tamper with our bodies... Humans are like hair on the body of the AGI and maybe it wants to shave and use a whig.

[-]Carl Feynman8mo41

Even worse: existing AI systems can call systems under the control of other companies, can write their own software and call it, or can be called by systems that are not themselves AI. How do you ensure they are safe under all permutations of such activities?

You could say “Well, don’t do that, then,” but that horse has left the barn.

[-]DusanDNesic8mo42

Even for humans - are my nails me? Once clipped, are they me? Is my phone me? I feel like my phone is more me than my hair, for example. Is my child me, are my memes me, is my country me, etc etc... There are many reasons why agent boundaries are problematic, and that problem continues in AI Safety research.

[-]Logan Riggs8mo20

Wait, I don't understand this at all. For language models, the environment is the text. For different environments, those training datasets will be the environment.

[-]Gunnar_Zarncke8mo20

This is not primarily about LLMs, which are Simulators (see also Janus' Simulators), but about more general systems - AGIs.

[-]Logan Riggs8mo20

I meant to cover this in the “for different environments” parts. Like if we self-play on certain games, we’ll still have access to those games.

[-]scasper8moΩ221

Several people seem to be coming to similar conclusions recently (e.g., this recent post).

I'll add that I have as well and wrote a sequence about it :)

[-]MiguelDev8mo10

I agree with Connor and Charbel's post. The next step is to establish a new method for sharing results with safety-focused companies, groups, and independent researchers. This requires:

Developing a screening method for inclusion.
Tracking people within the network, which becomes challenging, especially if they are recruited by capabilities-focused companies.

Continuing this line of thought, we can't ensure 100% that such a network will consistently serve its intended purpose. So, if anyone has insights that could improve this idea, I'd like to hear them.

Moderation Log