Alignment as an Evaluation Problem

wolverdude

A Layman's Model

This is not a "serious" model, nor do I think it is revolutionary in any way. I am an AI safety layperson, a run-of-the-mill software developer at Big Tech Firm™ who is aware of the general shape of AI safety issues, but not particularly read-up on the literature. My hope here is to refine my thinking and to potentially provide something that helps other laypeople think more clearly about current-paradigm AI.

I work with LLM "agents" a lot these days. I read IABIED the week it came out, and it fired off some thoughts for me given my first-hand experiences working with LLMs and trying to understand why I see the failure modes I do and how that relates to the wider AI safety discussion.

I work in testing, helping a normal software organization that doesn't train LLMs try to integrate and use them, and so I see AIs exhibiting unaligned behavior writ small. Things like a coding agent that swears up and down that it fixed the bug you pointed out while not actually changing the relevant code at all. Or a chatbot that spins its wheels really hard after a dozen prompts. Or the ever-classic: I get a near perfect response from the agent I'm developing, then I notice one of its tools is actually broken. So I fix it, and then with the much better context, the agent does much worse.

While vexing, this actually rhymes a little bit with software validation I worked on before LLMs were involved. Large systems with many interdependent parts become hard to predict and are often nondeterministic. Apparent fixes may actually break something somewhere else. And validating that system is hard: you can't model the internals of the system accurately enough, so you must rely on black-box testing. For a black-box, you can't actually know that it's correct without testing every possible behavior, which of course is not practical, so you focus testing on a small subset of use-cases that you expect are most important to users. Then you back-propagate failure signals (i.e. diagnose and fix the issues).

In my understanding, this is basically how the predominant training techniques work as well: They treat the model as a black box and use curated data to eval its performance and back-propagate those signals until the model works well across its training distribution.

My layperson's understanding of the MIRI position from my testing angle is this: Since AI models are black boxes^[1], we can't assume that they will be aligned (i.e. exhibit "correct" behavior) and we should assume that they won't be when operating outside of their training distribution (OOD) due to the extremely low probability of accidental alignment in such a large possibility space. We would likely be safe if the eval data were in fact complete for all possible use-cases including superintelligent actions that we can't understand at this point. This is of course not possible, so under the current training paradigm, we're f**ked.

But here's the thing I've been struggling to get my head around: When I observe OOD behavior, it by-and-large doesn't look like misalignment. It looks like stupidity. I think this is the reason people think that the current paradigm won't get us to AGI. I realized that I didn't have a good definition of AGI for myself or an explanation of why current AIs fail the way they do, so I pondered that and got what I present here. I now offer my layperson's model that explains why AI's current "stupidity" doesn't imply that the current paradigm can't get to AGI, why intelligence is not tied to alignment (i.e. we won't get alignment by default), and why the intelligence explosion is precisely the point where we'll see these diverge.

Distance from the Training Distribution

I think the thing people are thinking of with AGI is something that is able to "draw the owl" for any task they give it. Most of those tasks will not have clear instructions or be in its training data, but it should still be able to figure them out and succeed. But my understanding (and the fundamental assumption underpinning my model here) is that an AI model is only as good as its training. AIs have shown competence outside their training, but typically at least adjacent to what they have been trained on. I expect an AI's performance to look roughly like this:

Chart with "Performance" on the Y axis, "Distance from training distribution" on the X axis, and a downward-sloping line.

For tasks that are similar to training, the AI performs well; it is "smart". The less similar a task is, the worse AI performs. Far enough, and it acts "dumb".

Honestly, this just makes intuitive sense: consider how good you would be at a task for which you have little instruction or feedback. I don't think AIs are any different; just being smart enough doesn't give you the ability to magically understand the world through pure thought. So I do not expect the performance drop with distance from data to go away with AGI or even ASI. No matter how much compute an AI has access to, its power in the external world is still limited by data about that world, and the accessible universe is finite.

So if an AGI is still only as good as its training data, how can it get the kind of universal competence needed to consistently "draw the owl"? The same way humans do it: self-learning. It will go get its own training data.

Another way of saying this is that "general intelligence" isn't being "smart", it's the ability to become "smart" i.e. good at any given possible task.

Becoming smart... that sounds like self-improvement, no? So the AI improves itself and we get an intelligence explosion, etc. etc. But since I'm talking about data limits, I think one major cause of the intelligence explosion will be a massive increase in available data. The AI goes from only having the training data that humans have created for it to going out and finding its own training data. It gets smarter by being able to integrate more and more info about the world into itself (or successor AIs or narrow AIs). Once it is able to do this fast enough and well enough to execute "draw the owl" tasks, we will perceive it as being AGI.

Chart: AGI vs. current models - same as the first chart but with an added flat line for AGI — For AGI, training data is generated on-demand, so all tasks are in-distribution.

If all tasks are effectively in-distribution, why is this dangerous? Because of what data is available to train on past the point that humans are directing its training, and what data isn't.

Curated vs. Natural Data

Humans lose control of the process because we no longer supervise the training. The data that is currently used to train LLMs (at least after pre-training) is heavily curated. That curation implicitly (or explicitly) encodes human values. Thus, the AI is aligned by default in the sense that it's minimizing its loss against those values. But once we get to AGI, the machine is "figuring it out", with its own training data derived from natural sources, not from humans directly. And so it stops optimizing for human values since they are no longer targeted by the training data.

It may actually develop sophisticated models of human values, but these will not be its goals, they will just be instrumental to its goals. The goals themselves will be instrumentally-biased extensions into task space of the values in its original human-supervised training data. Its performance on those values in these more complex tasks was never evaluated or incorporated into the model, so we should expect these extensions to be roughly as accurate as a scientific theory that was not fully validated against evidence: similar in distance to our actual desires as Galen's four humors are to germ theory. This is why I'm in the Orthogonality camp.

Chart: looks just like the last chart except the downward line is labeled "values" and the horizontal line is labeled "tasks" — AGI performance by distance from its original training distribution.

Everyone Dies

The most useful and impactful tasks are far away from the original training, so we should expect task competence and values incompetence at those. What we should expect is a blend of instrumental values and whatever bizarre, laughable misunderstandings of the original training values got extrapolated out this far. By the point we're dealing with ASI, not only are we far away from a values-grounded AI, but the power balance has also shifted. It is actually we who are misaligned to its values. As such we are inconvenient to its goals. There is a simple, final solution to this misalignment, and the ASI will almost certainly pursue it.

What's wrong with my model?

TBH, I don't know, but I'd love to hear. I don't have a strong math background and haven't spent many hours understanding how LLMs work at a deep level or reading safety papers. That's where you come in. I expect that some responses will contain sophisticated math and technical jargon that are outside of my layman's training distribution, but don't let that stop you. Please pick away.

What I do think is not wrong is failing to account for interpretability research. True, if the AI isn't actually a black-box, my model might break down, but I really, really doubt it. For the complex software systems I test for a living, we already have that interpretability. We have full visibility into every line of code in the system. That gives us more tools for testing (e.g. unit tests, static analysis), but it hardly solves the problem. While you can write tests and prove a relatively small function is correct, that becomes computationally intractable for complex systems. And I know of no method for testing features that haven't been implemented, designed, or even conceived yet, which would be more analogous to validating AGI alignment before it's here. For an interpretability-based approach to solve alignment, it would need to be able to solve or sidestep those problems in addition to exposing the LLM internals. Note that this has not yet been achieved for software validation despite spending billions of dollars and millions of engineer hours on it over the past few decades.

Looking forward to hearing people's takes and comments!

I interpret the IABIED authors' "anything like current techniques" qualification to refer to black-box models, though probably not exclusively. ↩︎

LESSWRONG
LW