In my previous post Are extrapolation-based AIs alignable? I argued that an AI trained only to extrapolate some dataset (like an LLM) can't really be aligned, because it wouldn't know what information can be shared when and with whom. So to be used for good, it needs to be in the hands of a good operator.

That suggests the idea that the "operator" of an LLM should be another, smaller AI wrapped around it, trained for alignment. It would take care of all interactions with the world, and decide when and how to call the internal LLM, thus delegating most of the intelligence work to it.

Q1: In this approach, do we still need to finetune the LLM for alignment?

A: Hopefully not. We would train it only for extrapolation, and train the wrapper AI for alignment.

Q2: How would we train the wrapper?

A: I don't know. For the moment, handwave it with "the wrapper is smaller, and its interactions with the LLM are text-based, so training it for alignment should be simpler than training a big opaque AI for both intelligence and alignment at once". But it's very fuzzy to me.

Q3: If the LLM+wrapper combination is meant to be aligned, and the LLM isn't aligned on its own, wouldn't the wrapper need to know everything about human values?

A: Hopefully not, because information about human values can be coaxed out of the LLM (maybe by using magic words like "good", "Bertrand Russell", "CEV" and so on) and I'd expect the wrapper to learn to do just that.

Q4: Wouldn't the wrapper become a powerful AI of its own?

A: Again, hopefully not. My hypothesis is that its intelligence growth will be "stunted" by the availability of the LLM.

Q5: Wouldn't the wrapper be vulnerable to takeover by a mesa-optimizer in the LLM?

A: Yeah. I don't know how real that danger is. We probably need to see such mesa-optimizers in the lab, so we can train the wrapper to avoid invoking them.

Anyway, I understand that putting an alignment proposal out there is kinda sticking my head out. It's very possible that my whole idea is fatally incomplete or unworkable, like the examples Nate described. So please feel free to poke holes in it.


New Comment
19 comments, sorted by Click to highlight new comments since: Today at 3:08 AM

I am glad to hear additional people speaking up about their ideas for alignment, but I do think that this misses the core concern. I think the core concern is: what happens when the system takes actions with sufficiently complex and subtle effects that humans aren't able to adequately supervise and judge the impact. This could be because a brilliant scheming model (agent, mesa optimizer, simulacra of Machiavelli, etc) is deliberately sneaking past our watch, or it could be because the model is entangled in larger Moloch-driven systems that we can't adequately understand. In either case, I expect the agent-wrapper model to either be also unable to understand and thus unable to successfully steer towards safety, or I expect the agent-wrapper would need to be upgraded to superhuman intelligence, at which point its own emergent abilities become the new concern and we have just shifted the onus of alignment onto the agent-wrapper.

That being said, I don't think the idea is valueless. I think it could be helpful in delaying problems in the short term. Enabling us to operate at slightly higher capability levels without causing catastrophe. Delay is valuable!

I'm a bit torn about this. On one hand, yes, the situations an AI can end up in and the choices it'll have to make might be too complex for humans to understand. But on the other hand, we could say all we want is one incremental step in intelligence (i.e. making something smarter and faster than the best human researchers) without losing alignment. Maybe that's possible while still having the wrapper tractable. And then the AI itself can take care of next steps, if it cares about alignment as much as we do.

AI itself can take care of next steps, if it cares about alignment as much as we do

That's where I put most of P(doom), that the first AGIs are loosely aligned but only care about alignment about as much as we do, and that Moloch holds enough sway with them to urge immediate development of more capable AGIs, using their current capabilities to do that faster and more recklessly than humans could, well before serious alignment security norms are in place.

There will be fewer first AGIs than there are human researchers, and they will be smarter than human researchers. So if they care about alignment as much as we do, that seems like good news - they'll have an easier time coordinating and an easier time solving the problem. Or am I missing something?

Humans are exactly as smart as they have to be to build a technological civilization. First AGIs don't need to be smarter than that to build dangerous successor AGIs, and they are already faster and more knowledgeable, so they might even get away with being less intelligent than the smartest human researchers. Unless of course agency lags behind intelligence, like it does behind encyclopedic knowledge, and there is an intelligence overhang where the first autonomously agentic systems happen to be significantly more intelligent than humans. But this is not obviously how this goes.

The number of diverse AGI instances might be easy to scale, like with the system message of GPT-4 where the model itself is fine-tuned not into adherence to a particular mask, but into being a mask generator that presents as any mask that is requested. And it's not just the diverse AGIs that need to coordinate on alignment security, but also human users who prompt steerable AGIs. It's a greater feat than building new AGIs, then as it is now. At near-human level, I don't see how that state of affairs changes, and you don't need to get far from human level to build more dangerous AGIs.

It seems to me that agency does lag behind extrapolation capability. I can think of two reasons for that. First, extrapolation gets more investment. Second, agency might require a lot of training in the real world, which is slow, while extrapolation can be trained on datasets from the internet. If someone invents a way to train agency on datasets from the internet, or something like AlphaZero's self-play, in a way that carries over to the real world, I'll be pretty scared, but so far it hasn't happened afaik.

If the above is right, then maybe the first agent AIs will be few in number, because they'll have an incentive to stop other agent AIs from coming into existence and will be smart enough to do so, e.g. by taking over the internet or manipulating people.

Extrapolation capability is wielded by shoggoths and makes masks possible, but it's not wielded by the masks themselves. Like humans can't predict next tokens given a prompt (to the extent similar to how well LLMs can), neither can LLM characters (they can't disregard the rest of the context outside the target prompt to access their "inner shoggoth", let alone make use of that capability level for something more useful). So agency in masks doesn't automatically take advantage of extrapolation capability in shoggoths, doesn't turn masks superintelligent from merely becoming agentic. This creates the danger of only slightly superhuman AGIs that immediately muck up alignment security, once LLM masks do get to autonomous agency (which I'm almost certain they will eventually, unless something else happens first).

It's only shoggoths themselves waking up (learning to use situationally aware deliberation within the residual stream rather than context window) that makes an immediate qualitative capability discontinuity more likely (for LLMs). Looking at GPT-4 capability to solve complicated tasks without thinking out loud in tokens, I suspect that merely a slightly different SSL schedule with a sufficiently giant LLM might trigger that. Hence recently I'm operating under one year AGI timelines lower bound (lower 25% quantile), until the literature implies a negative result for that experiment (with GPT-4 level scale being necessary, this might take a while). This outcome both reduces the chances of direct alignment and increases the chances that alignment security gets sorted.

Yeah, and then we also want system A to be able to make a system B one step smarter than itself, which remains aligned with system A and with us. This needs to continue safely and successfully until we have a system powerful enough to prevent the rise of unaligned RSI AGI. That seems like a high level of capability to me, and I'm not sure getting there in small steps rather than big ones buys us much.

I think it does buy something. The AI one step after us might be roughly as aligned as us (or a bit less), but noticeably better at figuring out what the heck alignment is and how to ensure it on the next step.

I wonder if the following would help.

As AI ecosystem self-improves, it will eventually start discovering new physics, more and more rapidly, and this will result in the AI ecosystem having existential safety issues of its own (if the new physics is radical enough, it's not difficult to imagine the scenarios when everything gets destroyed including all AIs).

So I wonder if early awareness that there are existential safety issues relevant to the well-being of AIs themselves might improve the situation...

This sounds way too capable to be safe. Although someone is probably working on this right now, this line of thought getting traction might increase the number of people doing it 10x. Maybe that's good since GPT4 probably isn't smart enough to kill us, even with an agent wrapper. It will just scare the pants off of us.

Aligning the wrapper is somewhat similar to my suggestion of aligning an RL critic network head, such as humans seem to use. Align the captain, not the crew. And let the captain use the crew's smarts without giving them much say in what to do or how to update them.

It'd be interesting to figure out where the biggest danger in this setup is coming from. 1) Difficulty of aligning the wrapper 2) Wild behavior from the LLM 3) Something else. And whether there can be spot fixes for some of it.

I suspect that the most likely way of getting an outcome that doesn't killeveryone is from a mesa-optimizer that escapes a core of an Internet-pre-trained LLM (a shoggoth waking up). That is because at present, only LLM-based AGIs seem to have a chance of being loosely aligned, and LLM masks are too similar to humans, and therefore doomed to fail alignment security the same as humanity is currently failing it.

Shoggoths are less certain to be aligned than masks are, to put it mildly, but there is a better chance that they are surprisingly capable and don't fail alignment security (when the mean is insufficient, go for variance). And I don't think their alignment can be confidently ruled out, even as I see no clear reason for that happening other than essentially sympathetic magic (they are made from human utterances on the Internet) and naturality of boundary-like norms.

In my previous post are extrapolation-based AIs alignable? I argued that an AI trained only to extrapolate some dataset (like an LLM) can't really be aligned, because it wouldn't know what information can be shared when and with whom.

Mostly because this is not, in fact the task of alignment.

A better formulation of the alignment goal is this:

When thinking about AIs that are trained on some dataset and learn to extrapolate it, like the current crop of LLMs, I asked myself: can such an AI be aligned purely by choosing an appropriate dataset to train on? In other words, does there exist any dataset such that generating extrapolations from it leads to good outcomes from the perspective of the actor controlling the AI?

This is our actual task for alignment.

Problem I see, our values are defined in a stable way only inside the distribution. I.e. for the situations which are similar to those we have already experienced.

Outside of it there may be many radically different extrapolations which are consistent with themselves and with our values inside the distribution. And it's problem not with AI, but with the values themselves.

For example, there is no correct answer about what the human is. I.e. how much we can "improve" the human until it stops being a human. We can choose different answers and they will all be consistent with out pre-singularity concept of the human, and do not contradict with already established values.

Yeah. Or rather, we do have one possible answer - let the person themselves figure out by what process they want to be extrapolated, as steven0461 explained in this old thread - but that answer isn't very good, as it's probably very sensitive to initial conditions, like which brand of coffee you happened to drink before you started self-extrapolating.

"Making decision oneself" will also become a very vague concept when superconvincing AIs are running around.

This is actually a problem, but I do not believe there's a single answer to that question, indeed I suspect there are an infinite number of valid ways to answer the question (once we consider multiverses)

And I think the sensitivity to initial condition and assumptions is exactly what morality and values have. That is, one can freely change you assumptions, thus leading to inconsistent but complete morality.

The point is that your starting assumptions and conditions matter for what you eventually want to end up in.

Sounds like a fair idea that wouldn't actually work IRL. 

Upvoting to encourage the behavior of designing creative solutions. 

New to LessWrong?