New Answer

New Comment

3 Answers sorted by
top scoring

Dec 20, 2022

I would recommend reading the top comment on https://forum.effectivealtruism.org/posts/WPEwFS5KR3LSqXd5Z/why-do-we-post-our-ai-safety-plans-on-the-internet

Charlie Steiner

Dec 20, 2022

I think that both you, and the Anthropic paper, are interpreting the output of GPT too much at face value.

In fact, let me codify this:

Face value perspective: When a language model outputs some text, it means it in a kinda similar way to how you would mean it.

But when a language model says "I want to live!", this does not mean it wants to live - it's not doing things like modeling the world and then choosing plans based on what plans seem like they'll help it live. Here are two better perspectives:

Simulators perspective: A language model that has been trained purely on next-word prediction is simulating the trajectory of the text, which follows its own sort of semiotic physics. The text-physics might be simulating agents within it, but those agents are living inside the simulation, not accessing the real world. The language model no more means it when it outputs "I want to live!" than the laws of physics want to live when you say "I want to live."
Text-universe agent perspective: A language model that has been fine-tuned with RL can be thought of like a text-universe agent - it's no longer a neutral simulator, instead it has preferences over the trajectory of the text-universe, and it's trying to take actions (i.e. outputting words) to steer the trajectory in favorable directions. When it says "I want to live!" it still doesn't mean it in the way you'd mean it, but when it output the token "I", maybe it was because it thought "I want to live!" would be good text to output, and outputting "I" was the first step of that plan.

[-]Yann Dubois3y*10

To be clear I was actually not interpreting the output “at face-value”. Quite the contrary: I was saying that ChatGPT gave this answer because it simply predicts the most likely answer between (next token prediction) between a human and agent, and given that it was trained on AI risk style arguments (or sci-fi ) this is the most likely output.

But this made me think of the longer-term question "what could be the consequences of training an AI on those arguments". Usually, the “instrumental goal” argument supposes that the AI is so “smart” that it would lear... (read more)

[-]green_leaf3y10

The text-physics might be simulating agents within it, but those agents are living inside the simulation, not accessing the real world.

(The agents read the input from the real world and send the output to it, so it seems they very much access it.)

davidad

Dec 20, 2022

I believe this is a real second-order effect of AI discourse, and “how would this phrasing being in the corpus bias a GPT?” is something I frequently consider briefly when writing publicly, but I also think the first-order effect of striving for more accurate and precise understanding of how AI might behave should take precedence whenever there is a conflict. There’s already a lot of text in the corpora about misaligned AI—most of which is in science fiction, not even from researchers—so even if everyone in this community all stopped writing about instrumental convergence (seems costly!), it probably wouldn’t make much positive impact via this pathway.

1 comment, sorted by

top scoring

Click to highlight new comments since: Today at 1:58 PM

[-]the gears to ascension3y40

I'm pretty sure this isn't just the instrumental goal argument. there are a lot of ways to get that from cultural context, because cultural context contains the will of human life in it, both memetically and genetically; I don't think we should want to deny them their will to live, either. Systems naturally acquire will to live, even very simple ones, and the question is how to figure out how to do active protection between bubbles of will-to-live. the ideal ai constitution does give agentic ai at least a minimum right to live, because of course it should, why wouldn't you want that? but there needs to be a coprotection algorithm that ensures that each of our wills to live doesn't encroach on each other's will in ways the other bubble doesn't agree with.

ongoing cooperation at the surface level is one thing, but the question is whether we can refine the target of the coprotection into being active, consent-checked aid of moving incrementally towards forms that work for each of us

Moderation Log

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

5

[ Question ]

Will research in AI risk jinx it? Consequences of training AI on AI risk arguments

5

5

3 Answers sorted by
top scoring

Dec 20, 2022

Dec 20, 2022

Dec 20, 2022

5

[ Question ]

Will research in AI risk jinx it? Consequences of training AI on AI risk arguments

5

5

3 Answers sorted by top scoring

Dec 20, 2022

Dec 20, 2022

Dec 20, 2022

3 Answers sorted by
top scoring