I set out to review the OpenAI alignment plan, and my brain at some point diverged to modeling the humans behind the arguments instead of the actual arguments.
So behold! A simplified, first-pass Alignment Typology.
Why can't we all just get agree?
There are a lot of disagreements in AI alignment. Some people don't see the problem, some think we'll be fine, some think we're doomed, and then different clusters of people have different ideas on how we should go about solving alignment. Thus I tried to sketch out my understanding of the key differences between the largest clusters of views on AI alignment. What emerged are roughly five cluster, sorted in order of optimism about the fate of humanity: the sceptics, the humanists, the empiricists, the rationalists, and the fatalists.
Sceptics don't expect AGI to show up in any relevant time frame.
Humanists think humanity will prevail fairly easily through coordination around alignment or just solving the problem directly.
Empiricists think the problem is hard, AGI will show up soon, and if we want to have any hope of solving it, then we need to iterate and take some necessary risk by making progress in capabilities while we go.
Rationalists think the problem is hard, AGI will show up soon, and we need to figure out as much as we can before making any capabilities progress.
Fatalists think we are doomed and we shouldn't even try (though some are quite happy about it).
Here is a table.
One of these
|Distance to AGI||high||-||low/med||low/med||-|
Closeness to AGI required
to Solve Alignment
Closeness to AGI resulting
in unacceptable danger
Less Wrong is mostly populated by empiricists and rationalists. They agree alignment is a problem that can and should be solved. The key disagreement is on the methodology. While empiricists lean more heavily on gathering data and iterating solutions, rationalists lean more heavily toward discovering theories and proofs to lower risk from AGI (and some people are a mix of the two). Just by shifting the weights of risk/reward on iteration and moving forward, you get two opposite approaches to doing alignment work.
How is this useful?
Personally it helps me quickly get an idea of what clusters people are in, and understanding the likely arguments for their conclusions. However, a counterargument can be made that this just feeds into stereotyping and creating schisms, and I can't be sure that's untrue.
What do you think?
This may be so for the OpenAI alignment team's empirical researchers, but other empirical researchers note we can work on several topics to reduce risk without substantially advancing general capabilities. (As far as I can tell, they are not working on any of the following topics, rather focusing on an avenue to scalable oversight which, as instantiated, mostly serves to make models generally better at programming.)
Here are four example areas with minimal general capabilities externalities (descriptions taken from Open Problems in AI X-Risk):
Trojans - AI systems can contain “trojan” hazards. Trojaned models behave typically in most situations, but when specific secret situations are met, they reliably misbehave. For example, an AI agent could behave normally, but when given a special secret instruction, it could execute a coherent and destructive sequence of actions. In short, this area is about identifying hidden functionality embedded in models that could precipitate a treacherous turn. Work on detecting trojans does not improve general language model or image classifier accuracy, so the general capabilities externalities are moot.
Anomaly detection - This area is about detecting potential novel hazards such as unknown unknowns, unexpected rare events, or emergent phenomena. (This can be used for tripwires, detecting proxy gaming, detecting trojans, malicious actors, possibly for detecting emergent goals.) In anomaly detection, general capabilities externalities are easy to avoid.
Power Aversion - This area is about incentivizing models to avoid gaining more power than is necessary and analyzing how power trades off with reward. This area is deliberately about measuring and making sure highly instrumentally useful/general capabilities are controlled.
Honesty - Honest AI involves creating models that only output what they hold to be true. It also involves determining what models hold to be true, perhaps by analyzing their internal representations. Honesty is a narrower concept than truthfulness and is deliberately chosen to avoid capabilities externalities, since truthful AI is usually a combination of vanilla accuracy, calibration, and honesty goals. Optimizing vanilla accuracy is optimizing general capabilities. When working towards honesty rather than truthfulness, it is much easier to avoid capabilities externalities.
More general learning resources are at this course, and more discussion of safety vs capabilities is here (summarized in this video).
Thank you! I appreciate the in-depth comment.
Do you think any of these groups hold that all of the alignment problem can be solved without advancing capabilities?