Ha--I was about to type something very similar. The above comment in particular is also ambiguous--it could be alternatively read as coming from someone who has decided that nothing in the LLM reference class will ever be capable enough to be threatening or "incorrigible" in any meaningful sense because of underspecified inductive biases that promote ad-hoc shortcut learning, and maybe intrinsic limitations to human text as a medium to learn representations from. Someone might argue "Aha but by TRAINING the AI to predict the next token or perform longform ...
An AI without a factorized motivation system (like an LLM) seems more immediately corrigible than an AI with a motivation system, with a lower ceiling for capabilities. Fine-tuned LLMs are capable of accomplishing tasks, and it's possible to anthropomorphize them, but the "value knowledge" upstream of their behaviors exists in an ad-hoc, fractured state, and it can be crudely manipulated. Emergent Misalignment and Subliminal Learning are nice examples. The LLM training process is like an industrialized process for growing a big, connectionist grammar of ab... (read more)