I think you get the point but say openAI "trains" GPT-5 and it turns out to be so dangerous that it can persuade anybody of anything and it wants to destroy the world.

We're already screwed, right?  Who cares if they decide not to release it to the public?  Or like they can't "RLHF" it now, right?  It's already existentially dangerous?

I guess maybe I just don't understand how it works.  So if they "train" GPT-5, does that mean they literally have no idea what it will say or be like until the day that the training is done?  And then they are like "Hey what's up?" and they find out?

New to LessWrong?

New Comment
14 comments, sorted by Click to highlight new comments since: Today at 1:18 PM

A language model itself is just a description of a mathematical function that maps input sequences to output probability distributions on the next token.

Most of the danger comes from evaluating a model on particular inputs (usually multiple times using autoregressive sampling) and hooking up those outputs to actuators in the real world (e.g. access to the internet or human eyes).

A sufficiently capable model might be dangerous if evaluated on almost any input, even in very restrictive environments, e.g. during training when no human is even looking at the outputs directly. Such models might exhibit more exotic undesirable behavior like gradient hacking or exploiting side channels. But my sense is that almost everyone training current SoTA models thinks these kinds of failure modes are pretty unlikely, if they think about them at all.

You can also evaluate a partially-trained model at any point during training, by prompting it with a series of increasingly complex questions and sampling longer and longer outputs. My guess is big labs have standard protocols for this, but that they're mainly focused on measuring capabilities of the current training checkpoint, and not on treating a few tokens from a heavily-sandboxed model evaluation as potentially dangerous.

Perhaps at some point we'll need to start treating humans who evaluate SoTA language model checkpoint outputs as part of the sandbox border, and think about how they can be contained if they come into contact with an actually-dangerous model capable of superhuman manipulation or brain hacking.

I notice I am confused by this. Seems implausible that a LLM can execute a devious x-risk plan in a single forward-pass based on a wrong prompt.

Yeah, I don't think current LLM architectures, with ~100s of attention layers or whatever, are actually capable of anything like this.

But note that the whole plan doesn't necessarily need to fit in a single forward pass - just enough of it to figure out what the immediate next action is. If you're inside of a pre-deployment sandbox (or don't have enough situational awareness to tell), the immediate next action of any plan (devious or not) probably looks pretty much like "just output a plausible probability distribution on the next token given the current context and don't waste any layers thinking about your longer-term plans (if any) at all".

A single forward pass in current architectures is probably analogous to a single human thought, and most human thoughts are not going to be dangerous or devious in isolation, even if they're part of a larger chain of thoughts or planning process that adds up to deviousness under the right circumstances.

yes, this is correct. i believe we should solve alignment before building AI, rather than after. (in fact, alignment should be fundamental to the design of the AI, not a patch you apply after the fact)

Is there any way to do so given our current paradigm of pretraining and fine-tuning foundation models?

It's not clear, because we don't know what the solution might look like...

But there are certainly ways to improve the odds. For example, one could pretrain on heavily curated data (no atrocities, no betrayals, etc, etc). Additionally, one can use curricula like we teach children, starting with "age-appropriate" texts first.

Then if we succeed in interpretability, we might be able to monitor and adjust what's going on.

Here the remark of "alignment being fundamental" might come into play: we might figure out ways to replace Transformers with an architecture which is much easier to interpret.

All these are likely to be positive things, although without truly knowing a solution it's difficult to be sure...

Pretraining on curated data seems like a simple idea. Are there any papers exploring this?

I've reviewed someone's draft which suggests this for AI safety (I hope it will be made public soon).

But I've heard rumors that people are trying this... And even from what Janus is saying in the comments/answers to my question https://www.lesswrong.com/posts/tbJdxJMAiehewGpq2/impressions-from-base-gpt-4, I am getting a rather strong suspicion that GPT-4 pretraining has been using some data curation.

From Janus' two comments there I am getting an impression of a non-RLHF'd system which is, nevertheless, tends to be much stronger than usual in its convictions (or, the virtual characters it creates tend to be stronger than usual in their convictions about the nature of their current reality). There might be multiple reasons for that, but some degree of data curation might be one of them.

I think you get the point but say openAI "trains" GPT-5 and it turns out to be so dangerous that it can persuade anybody of anything and it wants to destroy the world.


Most scenarios that are dangerous are not as straightforward where GPT-5 has a clear goal to destroy the world.  If that's what the model does it's also fairly straightforward to just delete the model.

Should the title be rephrased?  The body of the post essentially addresses the scenario where OpenAI discovers (possibly via red-teaming) that their model is dangerous, and asks where do they go from there.  Which is a good question; it seems plausible that some naively think they could just tweak the model somewhat and make it non-dangerous, which seems to me a recipe for disaster when people get to jailbreaking it or get a copy and start un-tweaking it.  (A probably-better response would be to (a) lock it the fuck down [consider deleting it entirely], (b) possibly try to obfuscate the process that was used to create it, to obstruct others who'd try to reproduce your results, and (c) announce the danger, either to the world or specifically to law-enforcement types who'd work on preventing all AI labs from progressing any further.)

But the title asks "How do you red-team it to see whether it's safe", for which the straightforward answer is "Have people try to demonstrate dangerous capabilities with it".

I'm a beginner but I don't think this is right.

Pretend the model "wants" to kill us all.  You can't red-team it, can you?  Like imagine a model that wants to kill us all and you send me to see if it's dangerous.  If I come back and say "Yup, it's dangerous!  Shut it down" then it wasn't so dangerous in the first place.  

If I come back and say "No worries everyone-- it's fine!" and then three weeks later the world ends, then I guess it was dangerous

There are various things that could happen and it depends on your estimate of likelihoods.  If we have today's level of safety measures, and a new model comes along with world-ending capabilities that we can't detect, then, yeah, we don't detect them and the model goes out and ends the world.

It's possible, though, that before that happens, they'll create a model that has dangerous (possibly world-ending, possibly just highly destructive) capabilities (like knowing how to cross-breed smallpox with COVID, or how to hack into most internet-attached computers and parlay that into all sorts of mayhem) but that isn't good at concealing them, and they'll detect this, and announce it to the world, which would hopefully say a collective "Oh fuck" and use this result to justify imposing majorly increased security and safety mandates for further AI development.

That would put us in a better position.  Then maybe further iteration on that would convince the relevant people "You can't make it safe; abandon this stuff, use only the previous generation for the economic value it provides, and go and forcibly prevent everyone in the world from making a model this capable."

With LLMs as they are, I do suspect that dangerous cybersecurity capabilities are fairly near-term, and I don't think near-term models are likely to hide those capabilities (though it may take some effort to elicit them).  So some portions of the above seem likely to me; others are much more of a gamble.  I'd say some version of the above has a ... 10-30% chance of saving us?

Yes! I agree. I've been saying for a while that a possible outcome is a medium-danger model comes out that isn't that dangerous but convinces everyone that the potential for huge danger exists and we shut things down

We just "hope" that we will get first something that is dangerous but cannot outpower everyone, just trick some and then the rest will stop it. In your scenario, we are screwed yes. That's what this forum is about isn't it ;)