Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

(This post is a bit of a thought dump, but I hope it could be an interesting prompt to think about.)

For some types of problems, we can trust a proposed solution without trusting the method that generated the solution. For example, a mathematical proof can be independently verified. This means that we can trust a mathematical proof, without having to trust the mathematician who came up with the proof. Not all problems are like this. For example, in order to trust that a chess move is correct, then we must either trust the player who came up with the move (in terms of both their ability to play chess, and their motivation to make good suggestions), or we must be good at chess ourselves. This is similar to the distinction between NP (or perhaps more generally IP/PSPACE), and larger complexity classes (EXP, etc).

One of the things that make AI safety hard is that we want to use AI systems to solve problems whose solution we are unable (or at least unwilling) to verify. For example, automation isn't very useful if all parts of the process must be constantly monitored. More generally, we also want to use AI systems to get superhuman performance in domains where it is difficult to verify the correctness of an output (such as economic activity, engineering, politics, and etc). This means that we need to trust the mechanism which produces the output (ie the AI itself), and this is hard.

In order to trust the output of a large neural network, we must either verify its output independently, or we must trust the network itself. In order to trust the network itself, we must either verify the network independently, or we must trust the process that generated the network (ie training with SGD). This suggest that there are three ways to ensure that an AI-generated solution is correct: manually verify the solution (and only use the AI for problems where this is possible), find ways to trust the AI model (through interpretability, red teaming, formal verification, and etc), or find ways to trust the training process (through the science of deep learning, reward learning, data augmentation, and etc).

[SGD] -> [neural network] -> [output]

I think there is a fourth way, that may work: use an (uninterpretable) AI system to generate an interpretable AI system, and then let *this* system generate the output. For example, instead of having a neural network generate a chess move, it could instead generate an interpretable computer program that generates a chess move. We can then trust the chess move if we trust the program generated by the neural network, even if we don't trust the neural network, and even if we are unable to verify the chess move.

[SGD] -> [neural network] -> [interpretable computer program] -> [output]

To make this more concrete, suppose we want an LLM to give medical advice. In that case, we want its advice to be truthful and unbiased. For example, it should not be possible to prompt it into recommending homeopathy, etc. If we simply fine-tune the LLM with RLHF and red-teaming, then we can be reasonably sure that it probably won't recommend homeopathy. However, it is difficult to be *very* sure, because we can't try all inputs, and we can't understand what all the tensors are doing.

An alternative strategy is to use the LLM to generate an interpretable, symbolic expert system, and then let this expert system provide medical advice. Such a system might be easy to understand, and interpretable by default. For example, we might be able to definitively verify that there is no input on which it would recommend homeopathy. In that case, we could end up with a system whose outputs we trust, even if we don't verify the outputs, and even if we don't necessarily trust the neural network that we used to generate the program.

(Note that we are pretty close to being able to do things like this in practice. In fact, I am pretty sure that GPT-4 already would be able to generate a decent medical expert system, with a little bit of direction.)

Can this strategy always be used? Is it even possible to generate an interpretable, verifiable AI program that could do the job of a CEO, or would any such program necessarily have to be uninterpretable? I don't know the answer to that question. However, if the answer is "no", then mechanistic interpretability will also necessarily not scale to a neural network that can do the job of a CEO. Stated differently, if (strong) interpretability is possible, then there exist interpretable computer programs for all important tasks that we might want to use AI for. If this is the case, then we could (at least in principle) get a neural network to generate such an AI system for us, even if the neural network isn't interpretable by itself.

Another issue is, of course, that our LLM might be unable to write a program for all tasks that it could otherwise have performed itself (similar to how we, as humans, cannot create computer programs which do all tasks that we can do ourselves). Whether or not that is true, and to what extent it will continue to be true as LLMs (and similar systems) are scaled up, is an empirical question.

New Comment
9 comments, sorted by Click to highlight new comments since: Today at 2:35 AM

As I see it, there are two fundamental problems here:

1.) Generating an interpretable expert system code for an AGI is probably already AGI complete. It seems unlikely that a non-AGI DL model can output code for an AGI -- especially given that it is highly unlikely that there would be expert system AGIs in its training set -- or even things close to expert-system AGIs if deep learning keeps far out pacing GOFAI techniques.

2.) Building an interpretable expert system AGI is likely not just AGI complete but a fundamentally much harder problem than building a DL AGI system. Intelligence is extremely detailed, messy, and highly heuristic. All our examples of intelligent behaviour come from large blobs of optimized compute --  both brains and DL systems -- and none from expert systems. The actual inner workings of intelligence might just be fundamentally uninterpretable in their complexity except at a high level -- i.e. 'this optimized blob is the output of approximate bayesian inference over this extremely massive dataset'

  1. This is obviously true; any AI complete problem can be trivially reduced to the problem of writing an AI program that solves the problem. That isn't really a problem for the proposal here. The point isn't that we could avoid making AGI by doing this, the point is that we can do this in order to get AI systems that we can trust without having to solve interpretability.
  2. This is probably true, but the extent to which it is true is unclear. Moreover, if the inner workings of intelligence are fundamentally uninterpretable, then strong interpretability must also fail. I already commented on this in the last two paragraphs of the top-level post.

This is obviously true; any AI complete problem can be trivially reduced to the problem of writing an AI program that solves the problem. That isn't really a problem for the proposal here. The point isn't that we could avoid making AGI by doing this, the point is that we can do this in order to get AI systems that we can trust without having to solve interpretability.

Maybe I'm being silly but then I don't understand the safety properties of this approach. If we need an AGI based on uninterpretable DL to build this, then how do we first check if this AGI is safe?

The point is that you (in theory) don't need to know whether or not the uninterpretable AGI is safe, if you are able to independently verify its output (similarly to how I can trust a mathematical proof, without trusting the mathematician).

Of course, in practice, the uninterpretable AGI presumably needs to be reasonably aligned for this to work. You must at the very least be able to motivate it to write code for you, without hiding any trojans or backdoors that you are not able to detect.

However, I think that this is likely to be much easier than solving the full alignment problem for sovereign agents. Writing software is a myopic task that can be accomplished without persistent, agentic preferences, which means that the base system could be much more tool-like that the system which it produces.

But regardless of that point, many arguments for why interpretability research will be helpful also apply to the strategy I outline above.  

Do you have interesting tasks in mind where expert systems are stronger and more robust than a 1B model trained from scratch with GPT-4 demos and where it's actually hard (>1 day of human work) to build an expert system?

I would guess that it isn't the case: interesting hard tasks have many edge cases which would make expert systems break. Transparency would enable you to understand the failures when they happen, but I don't think that the stack of ad-hoc rules stacked on top of each other would be more robust than a model trained from scratch to solve the task. (The tasks I have in mind are sentiment classification and paraphrasing. I don't have enough medical knowledge to imagine what would the expert system look like for medical diagnosis.) Or maybe you have in mind a particular way of writing expert systems which ensures that the stack of ad-hoc rules doesn't interact in weird ways that produces unexpected results?

No, I don't have any explicit examples of that. However, I don't think that the main issue with GOFAI systems necessarily is that they have bad performance. Rather, I think the main problem is that they are very difficult and laborious to create. Consider, for example, IBM Watson. I consider this system to be very impressive. However, it took a large team of experts four years of intense engineering to create Watson, whereas you probably could get similar performance in an afternoon by simply fine-tuning GPT-2. However, this is less of a problem if you can use a fleet of LLM software engineers and have them spend 1,000 subjective years on the problem over the course of a weekend.

I also want to note that:
1. Some trade-off between performance and transparency is acceptable, as long as it is not too large. 
2. The system doesn't have to be an expert system: the important thing is just that it's transparent.
3. If it is impossible to create interpretable software for solving a particular task, then strong interpretability must also fail.


To clarify, the proposal is not (necessarily) to use an LLM to create an interpretable AI system that is isomorphic to the LLM -- their internal structure could be completely different. The key points are that the generated program is interpretable and trustworthy, and that it can solve some problem we are interested in. 

I mostly second Beren's reservations, but given that current models can already improved sorting algorithms in ways that didn't occur to humans (ref), I think it's plausible that they prove useful in generating algorithms for automating interpretability and the like. E.g., some elaboration on ACDC, or ROME, or MEMIT.

Note that this proposal is not about automating interpretability.