Summary: A proposal meant to produce value-aligned agents is 'advanced-safe' if it succeeds, or fails safely, in scenarios where the AI becomes much smarter than its human developers.

Definition.

A proposal for a value-alignment methodology, or some aspect of that methodology, is alleged 'advanced-safe' if that proposal is allegedly robust to scenarios where the agent:

Knows more or has better probability estimates than us

Learns new facts unknown to us

Searches a larger strategy space than we can consider

Confronts new instrumental problems we didn't foresee in detail

Gains power quickly

Has access to greater levels of cognitive power than in the regime where it was previously tested.

Importance.

Much of the reason to be worried about the value alignment problem for cognitively powerful agents is that there are problems like EdgeInstantiation or UnforeseenMaximums which don't materialize before an agent is advanced, or don't materialize in the same way, or as severely. There are problems of dealing with minds smarter than our own, doing things we didn't imagine, that seem qualitatively different from designing a toaster oven to not burn down a house, or from designing a general AI system that is dumber than human. This means that the concept of 'advanced safety' is importantly different from the concept of robust pre-advanced AI.

We have observed in practice that many proposals for 'AI safety' do not seem to have been thought through against advanced agent scenarios, so there seems to be a practical urgency to emphasizing the concept and the difference.

Key problems of advanced safety that are new or qualitatively different compared to pre-advanced AI safety include:

EdgeInstantiation

UnforeseenMaximums

ContextChange

ProgrammerDeception

ProgrammerMaximization

PhilosophicalCompetence

Non-advanced-safe methodologies may conceivably be useful if a KnownAlgorithmNonrecursiveAgent can be created that is (a) powerful enough to be relevant and (b) can be known not to become advanced. Even here there may be grounds for worry that such an agent finds unexpectedly strong strategies in some particular subdomain - that it exhibits flashes of domain-specific advancement that break a non-advanced-safe methodology.

Omni-safety.

As an extreme case, an 'omni-safe' methodology allegedly remains value-aligned, or fails safely, even if the agent suddenly becomes omniscient and omnipotent (acquires delta probability distributions on all facts of interest and has all describable outcomes available as direct options).

See real-world agents should be omni-safe.

Parents:

AI alignment

Children:

AI safety mindset

Safe but useless

and 9 more

Summary: A proposal meant to produce value-aligned agents is 'advanced-safe' if it succeeds, or fails safely, in scenarios where the AI becomes much smarter than its human developers.

Definition.

A proposal for a value-alignment methodology, or some aspect of that methodology, is alleged 'advanced-safe' if that proposal is allegedly robust to scenarios where the agent:

Knows more or has better probability estimates than us

Learns new facts unknown to us

Searches a larger strategy space than we can consider

Confronts new instrumental problems we didn't foresee in detail

Gains power quickly

Has access to greater levels of cognitive power than in the regime where it was previously tested.

Importance.

Key problems of advanced safety that are new or qualitatively different compared to pre-advanced AI safety include:

EdgeInstantiation

UnforeseenMaximums

ContextChange

ProgrammerDeception

ProgrammerMaximization

PhilosophicalCompetence

Omni-safety.

See real-world agents should be omni-safe.

Parents:

AI alignment

Children:

AI safety mindset

Safe but useless

and 9 more

LESSWRONG
LW

LESSWRONG
LW

Advanced safety

Definition.

Importance.

Omni-safety.

Advanced safety

Definition.

Importance.

Omni-safety.