An AI without a factorized motivation system (like an LLM) seems more immediately corrigible than an AI with a motivation system, with a lower ceiling for capabilities. Fine-tuned LLMs are capable of accomplishing tasks, and it's possible to anthropomorphize them, but the "value knowledge" upstream of their behaviors exists in an ad-hoc, fractured state, and it can be crudely manipulated. Emergent Misalignment and Subliminal Learning are nice examples. The LLM training process is like an industrialized process for growing a big, connectionist grammar of abstractions from text data. In this process, they induce text concepts that are remarkably functionally equivalent to the concepts that humans induce and utilize, assuming they have a... (read 593 more words →)

3

0

Replying toOn Fleshling Safety: A Debate by Klurl and Trapaucius.

mint yolk4mo

On Fleshling Safety: A Debate by Klurl and Trapaucius.

Ha--I was about to type something very similar. The above comment in particular is also ambiguous--it could be alternatively read as coming from someone who has decided that nothing in the LLM reference class will ever be capable enough to be threatening or "incorrigible" in any meaningful sense because of underspecified inductive biases that promote ad-hoc shortcut learning, and maybe intrinsic limitations to human text as a medium to learn representations from. Someone might argue "Aha but by TRAINING the AI to predict the next token or perform longform tasks we are providing a REASON for the AI to develop coherent, homuncular internal motivations that I FEAR! And have you also considered that gradient descent doesn't locate a RANDOM element of the space?"

2

0