Editor's note: This is an alignment stream-of-thought post, meaning it's literally a rambling catalog of my thoughts on something.

TLDR: GPT-n isn't AGI, but you can probably get AGI relatively easily by RL finetuning and/or self-improvement. Some people wrongly think GPT-n will immediately kill them though.

I believe that "GPT-n" and similar systems

  • Won't be agents/consequentialists by default, they will only simulate consequentialist sub-agent/simulacra. GPT-n won't kill you by default, GPT-n doesn't "want" anything.
    • (I'm uncertain about if sufficiently powerful "things" (like GPT-n) will become "agents"; currently, I think this isn't the case, since modeling GPT-3 as an agent doesn't work well.)
  • In GPT-n, The intelligence of each individual simulacra is approximately bounded by the intelligence of the smartest human in the training distribution. Prompting GPT-n with "A solution to the alignment problem:" will never work. GPT-n is not an oracle. (This doesn't imply mimicry is doomed, though I'm pessimistic for reasons John gives plus requiring the coordination to simulate alignment researchers but not capabilities researchers which seems hard to me.)
  • GPT-n may well be superintelligent (It already is in some sense) but even if GPT-n understands more (e.g. about physics) than the smartest human, it will never report this understanding because that isn't what a human on the training distribution would be likely to write. 
  • This isn't to say GPT-n won't lead to AGI. Just that there's some nontrivial prompting/finetuning to be done in order to access the latent knowledge. I'm not sure how hard this step is. The main angles of attack I see: using GPT-n to self-improve, RL-based fine-tuning, or maybe crazy Codex-based stuff like a fancier AutoML-Zero.
New Comment