Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

This is my attempt at Eliezer's challenge.

My overall impression of this plan is that it is surprisingly good. Not that I am particularly surprised such a plan exists, if MIRI had created this plan it would have been surprisingly bad. I think it is somewhat plausible that this plan could actually work, or at least some steel-manned version could work if several unknown parameters are set to favorable values. 

Our alignment research aims to make artificial general intelligence (AGI) aligned with human values and follow human intent. We take an iterative, empirical approach: by attempting to align highly capable AI systems, we can learn what works and what doesn’t, thus refining our ability to make AI systems safer and more aligned. Using scientific experiments, we study how alignment techniques scale and where they will break.

We tackle alignment problems both in our most capable AI systems as well as alignment problems that we expect to encounter on our path to AGI. Our main goal is to push current alignment ideas as far as possible, and to understand and document precisely how they can succeed or why they will fail.

You stand in a vast minefield, one end armed with popping balloons, the other end armed with some quantum vacuum decay weapon.  Between there is a huge range, fireworks, conventional mines, nukes, antimatter and more. 

Your plan is to wander the safer parts of the minefield, recording where mines go off, in the hope you spot a pattern that can lead you through the whole minefield. 

This plan is not hopelessly doomed. But it is risky. You definitely need a way to detect that dangerous terrain is approaching, and to stop before you get there. Your job isn't to charge ahead. It is to march back and forth over swaths of balloon filled ground, carefully recording every balloon that pops, and scrutinizing the data for a pattern. Maybe you need to venture a little further into the space of firecrackers. But have a plan for where you stop, and an idea what the danger signals would be.

It may be that there is no pattern in the mines. Or at least none you can discern. In which case, you don't venture further. You don't keep on, hoping that you will spot a pattern with just a few more large fireworks. You go back to base. And you hope that someone somewhere has been working on an airplane to fly over the minefield, and you can ask to help out with that.


 We believe that even without fundamentally new alignment ideas, we can likely build sufficiently aligned AI systems to substantially advance alignment research itself.

That is at least a possibility. A favorable bit not totally implausible setting of those hidden dials. You probably need some new ideas, and if someone shows you a paper on say conservative learning, how to learn a classification boundary that is big enough to fit the datapoints and no bigger, be ready to read that paper. 

Unaligned AGI could pose substantial risks to humanity and solving the AGI alignment problem could be so difficult that it will require all of humanity to work together. Therefore we are committed to openly sharing our alignment research when it’s safe to do so: We want to be transparent about how well our alignment techniques actually work in practice and we want every AGI developer to use the world’s best alignment techniques.

I do hope you have some procedure for deciding "when it's safe to do so". And ideally a way to share your results with a few other top labs, if you deem something safe enough to share with Deepmind or MIRI, but not safe enough to make public. 

At a high-level, our approach to alignment research focuses on engineering a scalable training signal for very smart AI systems that is aligned with human intent. It has three main pillars:

  1. Training AI systems using human feedback
  2. Training AI systems to assist human evaluation
  3. Training AI systems to do alignment research

The methods you are trying are all known to fail at sufficiently high levels of intelligence. But if these are your only ideas, it is possible they get you far enough for GPT-5 to output a better idea. If you are going to delve into the hacky methods that might possibly drag you just far enough, here are some more.

  1. Try to find an "honesty vector" by comparing model weights when you know it's lying to model weights when you think it's probably truthful. Add that vector to the weights of any other calculation you want to be more honest. (Pick the really easy questions for the positive examples. Accidentally training towards only answering easy questions is far safer than accidentally training towards successful deception.)  
  2. Use current crude interpretability tools to see what the AI is thinking about. If what it's thinking has nothing to do with what it is saying, this is a red flag.


There is currently no known indefinitely scalable solution to the alignment problem. As AI progress continues, we expect to encounter a number of new alignment problems that we don’t observe yet in current systems. Some of these problems we anticipate now and some of them will be entirely new.

As we trek further into the minefield, we expect to encounter entirely new kinds of explosives. You should be planning not to encounter them, at least for the really dangerous ones. 

We believe that finding an indefinitely scalable solution is likely very difficult. Instead, we aim for a more pragmatic approach: building and aligning a system that can make faster and better alignment research progress than humans can.

It is at least plausible that a GPT based approach can learn a superhumanly large pile of heuristics that can output interesting AI ideas. My understanding of existing AI is more like you took an existing article from the training data, and asked humans to translate it into Chinese and back. New words for the existing ideas, but little more than random mutation in the ideas space. However, that could easily be wrong, or could change with further scaling. 

As we make progress on this, our AI systems can take over more and more of our alignment work and ultimately conceive, implement, study, and develop better alignment techniques than we have now. They will work together with humans to ensure that their own successors are more aligned with humans.

Optimistic. I think this still deserves to be called a plan. It has a strong vein of wishlist running through it. 

We believe that evaluating alignment research is substantially easier than producing it, especially when provided with evaluation assistance. Therefore human researchers will focus more and more of their effort on reviewing alignment research done by AI systems instead of generating this research by themselves. Our goal is to train models to be so aligned that we can off-load almost all of the cognitive labor required for alignment research.

If you can offload almost all the work of alignment, you can probably offload most of capabilities too. You have an AI you can just ask for "code for a superintelligence" and get back run-able code for a potentially unaligned superintelligence. Unless of course you have used some sort of fine tuning to stop this, and the fine tuning works better than it did for ChatGPT. This isn't in and of itself doom. If you get this far, you have constructed an artifact powerful enough to save or destroy the world. May you be careful with it.  


Language models are particularly well-suited for automating alignment research because they come “preloaded” with a lot of knowledge and information about human values from reading the internet. Out of the box, they aren’t independent agents and thus don’t pursue their own goals in the world. To do alignment research they don’t need unrestricted access to the internet. Yet a lot of alignment research tasks can be phrased as natural language or coding tasks.

Language models come with several great advantages, but also a great disadvantage. Language models automatically learn all the dark arts of persuasion and manipulation.

Suppose some programmer at openAI asks for "a highly convincing argument that the world is flat". The argument is indeed highly convincing. The programmer is convinced. Do you have a plan for this situation? My plan would be: Delete the language model, and any documents that you think were written by the language model. And any documents that might contain the programmer rephrasing the arguments in their own words. Send the programmer on at least 6 months paid leave, and psycological screening.

Suppose the programmer asks for "a highly convincing argument for why AI is totally safe". And is indeed convinced. Similar to before, except now the programmer is on definite leave, as in the definitely aren't coming back. And you told everyone else enough of what happened that they won't be working elsewhere on anything relating to AI. Pay them full salary to sit at home knitting.  This is probably the point you want to do serious soul searching over why the question was put into such a dangerous model. And possibly the point you want to give up as an AI research org and hope someone else can safely align AI. This mine blew your leg off. Going any further is suicide.

But why delete the model? Shouldn't we use such a powerful model for good? Like maybe ask it to generate reasons AI is really dangerous, and send them to some people. 

No. There are some spells so dark, no light wizard should ever cast them. 

I suppose I have to explain in detail why this is a bad idea, for the benefit of those who just don't get it. The AI has already demonstrated the ability to hack human brains, to load information uncorrelated with reality directly into a smart and functioning mind. 

Suppose you do get the AI to create such an argument. You send it to a few politicians who are making noises about alignment being a waste of public money. Magic this dark is harder to control than it is to unleash. The argument inevitably goes viral online. Now a large fraction of the population, including nearly every alignment researcher, has seen the argument. They are all convinced AI is dangerous because giant space penguins like the taste of AI, and will eat the earth if the earth has too many AI's on it. (Or some other nonsense that makes Scientology look sane in comparison) 3000 years later, humanity has rebuild from the rubble and sets out into space, on a holy quest to destroy all giant penguins. It can only be Murphy's law that the first aliens humanity came across had a decidedly penguinoid appearance. Murphy's law and the fact that, all those years ago, some corner of GPT-5 had a surprisingly good grasp of astrobiology and had successfully predicted one of the most common body plans to evolve across the universe.

New Comment
1 comment, sorted by Click to highlight new comments since:

Upvoted since I like how literally you went through the plan. I think we need to think about and criticize both, the literal version of the plan and the way it intersects with reality.


The methods you are trying are all known to fail at sufficiently high levels of intelligence. But if these are your only ideas, it is possible they get you far enough for GPT-5 to output a better idea.

To me this seems like a key point that many other critiques are missing that focus on specific details.