1 min read18th Feb 20238 comments
This is a special post for quick takes by No77e. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.

New to LessWrong?

8 comments, sorted by Click to highlight new comments since: Today at 2:27 PM

Iff LLM simulacra resemble humans but are misaligned, that doesn't bode well for S-risk chances. 

Waluigi effect also seems bad for s-risk. "Optimize for pleasure, ..." -> "Optimize for suffering, ...".

An optimistic way to frame inner alignment is that gradient descent already hits a very narrow target in goal-space, and we just need one last push.

A pessimistic way to frame inner misalignment is that gradient descent already hits a very narrow target in goal-space, and therefore S-risk could be large.

This community has developed a bunch of good tools for helping resolve disagreements, such as double cruxing. It's a waste that they haven't been systematically deployed for the MIRI conversations. Those conversations could have ended up being more productive and we could've walked away with a succint and precise understanding about where the disagreements are and why.

We should implement Paul Christiano's debate game with alignment researchers instead of ML systems

If you try to write a reward function, or a loss function, that caputres human values, that seems hopeless. 

But if you have some interpretability techniques that let you find human values in some simulacrum of a large language model, maybe that's less hopeless.

The difference between constructing something and recognizing it, or between proving and checking, or between producing and criticizing, and so on...

As a failure mode of specification gaming, agents might modify their own goals. 

As a convergent instrumental goal, agents want to prevent their goals to be modified.

I think I know how to resolve this apparent contradiction, but I'd like to see other people's opinions about it.

Why this shouldn't work? What's the epistemic failure mode being pointed at here?