This website requires javascript to properly function. Consider activating javascript to get access to all site functionality.
LESSWRONG
Tags
LW
Login
Outer Alignment
•
Applied to
Inverse Scaling Prize: Second Round Winners
by
agg
at
9d
•
Applied to
Some of my disagreements with List of Lethalities
by
TurnTrout
at
16d
•
Applied to
The Alignment Problems
by
Martín Soto
at
21d
•
Applied to
Categorizing failures as “outer” or “inner” misalignment is often confused
by
Raemon
at
1mo
•
Applied to
Causal representation learning as a technique to prevent goal misgeneralization
by
PabloAMC
at
1mo
•
Applied to
On the Importance of Open Sourcing Reward Models
by
elandgre
at
1mo
•
Applied to
Will research in AI risk jinx it? Consequences of training AI on AI risk arguments
by
Yann Dubois
at
1mo
•
Applied to
Paper: Constitutional AI: Harmlessness from AI Feedback (Anthropic)
by
LawrenceC
at
2mo
•
Applied to
Disentangling Shard Theory into Atomic Claims
by
Leon Lang
at
2mo
•
Applied to
Alignment with argument-networks and assessment-predictions
by
Tor Økland Barstad
at
2mo
•
Applied to
Inner and outer alignment decompose one hard problem into two extremely hard problems
by
TurnTrout
at
2mo
•
Applied to
Alignment allows "nonrobust" decision-influences and doesn't require robust grading
by
TurnTrout
at
2mo
•
Applied to
Don't align agents to evaluations of plans
by
TurnTrout
at
2mo
•
Applied to
[Hebbian Natural Abstractions] Introduction
by
Samuel Nellessen
at
2mo
•
Applied to
The Disastrously Confident And Inaccurate AI
by
Sharat Jacob Jacob
at
3mo
•
Applied to
A first success story for Outer Alignment: InstructGPT
by
Noosphere89
at
3mo
•
Applied to
Don't you think RLHF solves outer alignment?
by
Noosphere89
at
3mo
•
Applied to
If you’re very optimistic about ELK then you should be optimistic about outer alignment
by
Noosphere89
at
3mo
•
Applied to
Questions about Value Lock-in, Paternalism, and Empowerment
by
Sam
at
3mo