mr-ubik

Could LLMs develop the type of self awareness you describe as part of their own training or RL-based fine-tuning? Many LLM do seem to have "awareness" of their existence and function (incidentally this could be evidenced by the model evals run by Anthropic). I assume a simple future setup could be auto-GPT-N with a prompt like "You are the CEO of Walmart, you want to make the company maximally profitable" in that scenario I would contend that the Agent could be easily aware of both its role and function and easily be attracted to that search space.
Could we detect deployed (and continually learning) agent entering these attractors?

mr-ubik3y

My Objections to "We’re All Gonna Die with Eliezer Yudkowsky"

Evolution can only optimize over our learning process and reward circuitry, not directly over our values or cognition. Moreover, robust alignment to IGF requires that you even have a concept of IGF in the first place. Ancestral humans never developed such a concept, so it was never useful for evolution to select for reward circuitry that would cause humans to form values around the IGF concept.

Another example may be lactose tolerance. First you need animal husbandry and dairy production, then you get selective pressure favoring those who can reliably process lactose, without the "concept of husbandry" there's no way for the optimizer to select for it.

LESSWRONG
LW

LESSWRONG
LW

mr-ubik

mr-ubik

mr-ubik

mr-ubik

mr-ubik

mr-ubik