AI Safety Subprojects
Practical Guide to Anthropics
Anthropic Decision Theory
Subagents and impact measures
If I were a well-intentioned AI...


Nearest unblocked strategy versus learning patches

I agree with you; this is an old post that I don't really agree with any more.

Morally underdefined situations can be deadly

The things avoided seem like they increase, not decrease risk.

Yes, but it might be that the means needed to avoid them - maybe heavy-handed AI interventions? - could be even more dangerous.

Morally underdefined situations can be deadly

The second one (though I think there is some overlap with the first).

Morally underdefined situations can be deadly

It is indeed our moral intuitions that are underdefined, not the states of the universe.

The Goldbach conjecture is probably correct; so was Fermat's last theorem

You need to add some assumptions to make it work. For example, I believe the following works:

"In second order arithmetic, we can prove that NP1 implies NF, where NP1 is the statement 'there exists no first order proof of the conjecture' and NF is the statement 'the conjecture isn't false'."

Research Agenda v0.9: Synthesising a human's preferences into a utility function

Because our preferences are inconsistent, and if an AI says "your true preferences are ", we're likely to react by saying "no! No machine will tell me what my preferences are. My true preferences are , which are different in subtle ways".

General alignment plus human values, or alignment via human values?

Thanks for developing the argument. This is very useful.

The key point seems to be whether we can develop an AI that can successfully behave as a low impact AI - not as a "on balance, things are ok", but a genuinely low impact AI that ensure that we don't move towards a world where our preference might be ambiguous or underdefined.

But consider the following scenario: the AGI knows that, as a consequence of its actions, one AGI design will be deployed rather than another. Both of these designs will push the world into uncharted territory. How should it deal with that situation?

General alignment plus human values, or alignment via human values?

The successor problem is important, but it assumes we have the values already.

I'm imagining algorithms designing successors with imperfect values (that they know to be imperfect). It's a somewhat different problem (though solving the classical successor problem is also important).

General alignment plus human values, or alignment via human values?

My thought is that when deciding to take a morally neutral act with tradeoffs, the AI needs to be able to balance the positive and negative to get a reasonable acceptable tradeoff, and hence needs to know both positive and negative human values to achieve that.

Load More