Stuart_Armstrong

The future of alignment if LLMs are a bubble

We might be in a generative AI bubble. There are many potential signs of this around: * Business investment in generative AI have had very low returns. * Expert opinion is turning against LLM, including some of the early LLM promoters (I also get this message from personal conversations with...

Dec 23, 202548

Go home GPT-4o, you’re drunk: emergent misalignment as lowered inhibitions

Replicating the Emergent Misalignment model suggests it is unfiltered, not unaligned We were very excited when we first read the Emergent Misalignment paper. It seemed perfect for AI alignment. If there was a single 'misalignment' feature within LLMs, then we can do a lot with it – we can use...

Mar 18, 202581

Using Prompt Evaluation to Combat Bio-Weapon Research

With many thanks to Sasha Frangulov for comments and editing Before publishing their o1-preview model system card on Sep 12, 2024, OpenAI tested the model on various safety benchmarks which they had constructed. These included benchmarks which aimed to evaluate whether the model could help with the development of Chemical,...

Feb 19, 202511

Defense Against the Dark Prompts: Mitigating Best-of-N Jailbreaking with Prompt Evaluation

This is a linkpost for a new research paper of ours, introducing a simple but powerful technique for preventing Best-of-N jailbreaking. Abstract > Recent work showed Best-of-N (BoN) jailbreaking using repeated use of random augmentations (such as capitalization, punctuation, etc) is effective against all major large language models (LLMs). We...

Jan 31, 202516

Alignment can improve generalisation through more robustly doing what a human wants - CoinRun example

Many AI alignment problems are problems of goal misgeneralisation[1]. The goal that we've given the AI, through labelled data, proxies, demonstrations, or other means, is valid in its training environment. But then, when the AI goes out of the environment, the goals generalise dangerously in unintended ways. As I've shown...

Nov 21, 202367

How toy models of ontology changes can be misleading

Before solving complicated problems (such as reaching a decision with thousands of variables and complicated dependencies) it helps to focus on solving simpler problems first (such as utility problems with three clear choices), and then gradually building up. After all, "learn to walk before you run", and with any skill,...

Oct 21, 202342

Different views of alignment have different consequences for imperfect methods

Almost any powerful AI, with almost any goal, will doom humanity. Hence alignment is often seen as a constraint on AI power: we must direct the AI’s optimisation power in a very narrow direction. If the AI is weak, then imperfect methods of alignment might be sufficient. But as the...

Sep 28, 202333

Stuart_Armstrong

Stuart_Armstrong

The AI in a box boxes you

Using GPT-Eliezer against ChatGPT Jailbreaking

Assessing Kurzweil predictions about 2019: the results

Just another day in utopia

Stuart_Armstrong

The AI in a box boxes you

Using GPT-Eliezer against ChatGPT Jailbreaking

Assessing Kurzweil predictions about 2019: the results

Just another day in utopia

The future of alignment if LLMs are a bubble

Go home GPT-4o, you’re drunk: emergent misalignment as lowered inhibitions

Using Prompt Evaluation to Combat Bio-Weapon Research

Defense Against the Dark Prompts: Mitigating Best-of-N Jailbreaking with Prompt Evaluation

Alignment can improve generalisation through more robustly doing what a human wants - CoinRun example

How toy models of ontology changes can be misleading

Different views of alignment have different consequences for imperfect methods