1 min read3 comments
This is a special post for quick takes by Veedrac. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.
3 comments, sorted by Click to highlight new comments since:

Reply to https://twitter.com/krishnanrohit/status/1794804152444580213, too long for twitter without a subscription so I threw it here, but do please treat it like a twitter comment.

rohit: Which part of [the traditional AI risk view] doesn't seem accounted for here? I admit AI safety is a 'big tent' but there's a reason they're congregated together.

You wrote in your list,

the LLMs might start even setting bad objectives, by errors of omission or commission. this is a consequence of their innards not being the same as people (either hallucinations or just not having world model or misunderstanding the world)

In the context of traditional AI risk views, this misses the argument. Roughly the concern is instead like so:

ASI is by definition very capable of doing things (aka. selecting for outcomes), in at least all the ways collections of humans can. It is both theoretically true and observably the case in reality that when things are selected for, a bunch of other things that aren't that are traded off, and that the stronger something is selected for, the more stuff ends up traded against, incidentally or not.

We should expect any ASI to have world-changing effects, and for those effects to trade off strongly against other things. There is a bunch of stuff we want that we don't want traded off (eg. being alive).

The first problem is that we don't know how to put any preferences into an AI such that it's robust to even trivial selection pressure, not in theory, not in practice on existing models, and certainly not in ways that would apply to arbitrary systems that indirectly contain ML models but aren't constrained by those models’ expressed preferences.

The second problem is that there are a bunch of instrumental goals (not eg. lying, but eg. continuing to have causal effect on the world) that are useful to almost all goals, and that are concrete examples of why an ASI would want to disempower humans. Aka. almost every thing that could plausibly be called an ASI will be effective at doing a thing, and the natural strategies for doing things involve not failing at them in easily-foreseeable ways.

Stuff like lying is not the key issue here. It often comes up because people say ‘why don’t we just ask the AI if it’s going to be bad’ and the answer is basically code for ‘you don’t seem to understand that we are talking about something that is trying to do a thing and is also good at it.’

Similarly for ‘we wouldn't even know why it chooses outcomes, or how it accomplishes them’ — these are problematic because they are yet another reason to rule out simple fixes, not because they are fundamental to the issue. Like, if you understand why a bridge falls down, you can make a targeted fix and solve that problem, and if you don’t know then probably it’s a lot harder. But you can know every line of code of Stockfish (pre-NNUE) and still not have a chance against it, because Stockfish is actively selecting for outcomes and it is better at selecting them than you.

“LLMs have already lied to us” from the traditional AI risk crowd is similarly not about LLM lying being intrinsically scary, it is a yell of “even here you have no idea what you are doing, even here you have these creations you cannot control, so how in the world do you expect any of this to work when the child is smarter than you and it’s actually trying to achieve something?”

What do you mean by "robust to even trivial selection pressure"?

Eg. a moderately smart person asking it to do something else by trying a few prompts. We're getting better at this for very simple properties but I still consider it unsolved there.